compilation 0368-3133 lecture 1: introduction lexical analysis noam rinetzky 1

148
Compilation 0368-3133 Lecture 1: Introduction Lexical Analysis Noam Rinetzky 1

Upload: theodora-ryan

Post on 02-Jan-2016

233 views

Category:

Documents


4 download

TRANSCRIPT

1

Compilation 0368-3133

Lecture 1:Introduction

Lexical Analysis

Noam Rinetzky

2

3

Admin

• Lecturer: Noam Rinetzky– [email protected]– http://www.cs.tau.ac.il/~maon

• T.A.: Orr Tamir

• Textbooks: – Modern Compiler Design – Compilers: principles, techniques and tools

4

Admin• Compiler Project 40%

– 4.5 practical exercises– Groups of 3

• 1 theoretical exercise 10%– Groups of 1

• Final exam 50% – must pass

5

Course Goals

• What is a compiler• How does it work• (Reusable) techniques & tools

6

Course Goals

• What is a compiler• How does it work• (Reusable) techniques & tools

• Programming language implementation– runtime systems

• Execution environments– Assembly, linkers, loaders, OS

7

Introduction

Compilers: principles, techniques and tools

8

What is a Compiler?

9

What is a Compiler?

“A compiler is a computer program that transforms source code written in a programming language (source language) into another language (target language).

The most common reason for wanting to transform source code is to create an executable program.”

--Wikipedia

10

What is a Compiler?source language target language

Compiler

Executable

code

exe

Source

text

txt

11

What is a Compiler?

Executable

code

exe

Source

text

txt

Compiler

int a, b;a = 2;b = a*2 + 1;

MOV R1,2SAL R1INC R1MOV R2,R1

12

What is a Compiler?source language target language

CC++

PascalJava

PostscriptTeX

PerlJavaScript

PythonRuby

Prolog

LispScheme

MLOCaml

IA32IA64

SPARC

CC++

PascalJava

Java Bytecode…

Compiler

13

High Level Programming Languages• Imperative Algol, PL1, Fortran, Pascal, Ada, Modula, C

– Closely related to “von Neumann” Computers• Object-oriented Simula, Smalltalk, Modula3, C++, Java,

C#, Python– Data abstraction and ‘evolutionary’ form of program

development• Class an implementation of an abstract data type (data+code)• Objects Instances of a class• Inheritance + generics

• Functional Lisp, Scheme, ML, Miranda, Hope, Haskel, OCaml, F#

• Logic Programming Prolog

14

More Languages• Hardware description languages VHDL

– The program describes Hardware components– The compiler generates hardware layouts

• Graphics and Text processing TeX, LaTeX, postscript– The compiler generates page layouts

• Scripting languages Shell, C-shell, Perl– Include primitives constructs from the current software

environment• Web/Internet HTML, Telescript, JAVA, Javascript • Intermediate-languages Java bytecode, IDL

15

High Level Prog. Lang., Why?

16

High Level Prog. Lang., Why?

17

Compiler vs. Interpreter

18

Compiler

• A program which transforms programs • Input a program (P)• Output an object program (O)

– For any x, “O(x)” “=“ “P(x)”

Compiler

Source

text

txt

Executable

code

exe

P O

19

Compiling C to Assembly

Compiler

int x;scanf(“%d”, &x);x = x + 1 ;printf(“%d”, x);

add %fp,-8, %l1mov %l1, %o1call scanfld [%fp-8],%l0add %l0,1,%l0st %l0,[%fp-8]ld [%fp-8], %l1mov %l1, %o1call printf

5

6

20

Interpreter

• A program which executes a program• Input a program (P) + its input (x)• Output the computed output (P(x))

Interpreter

Source

text

txt

Input

Output

21

Interpreting (running) .py programs

• A program which executes a program• Input a program (P) + its input (x)• Output the computed output (“P(x)”)

Interpreter

5

int x;scanf(“%d”, &x);x = x + 1 ;printf(“%d”, x);

6

22

Compiler vs. InterpreterSource

Code

Executable

Code Machine

Source

Code

Intermediate

Code Interpreter

preprocessing

processingpreprocessing

processing

23

Compiled programs are usually more efficient than

scanf(“%d”,&x);y = 5 ;z = 7 ;x = x + y * z;printf(“%d”,x);

add %fp,-8, %l1mov %l1, %o1call scanfmov 5, %l0st %l0,[%fp-12]mov 7,%l0st %l0,[%fp-16]ld [%fp-8], %l0ld [%fp-8],%l0add %l0, 35 ,%l0st %l0,[%fp-8]ld [%fp-8], %l1mov %l1, %o1 call printf

Compiler

24

Compilers report input-independent possible errors• Input-program

• Compiler-Output – “line 88: x may be used before set''

scanf(“%d”, &y);if (y < 0)

x = 5;...If (y <= 0)

z = x + 1;

25

Interpreters report input-specific definite errors

• Input-program

• Input data – y = -1– y = 0

scanf(“%d”, &y);if (y < 0)

x = 5;...If (y <= 0)

z = x + 1;

26

Interpreter vs. Compiler

• Conceptually simpler – “define” the prog. lang.

• Can provide more specific error report

• Easier to port

• Faster response time

• [More secure]

• How do we know the translation is correct?

• Can report errors before input is given

• More efficient code– Compilation can be expensive – move computations to

compile-time• compile-time + execution-time <

interpretation-time is possible

27

Concluding Remarks

• Both compilers and interpreters are programs written in high level language

• Compilers and interpreters share functionality

• In this course we focus on compilers

28

Ex 0: A Simple Interpreter

29

Toy compiler/interpreter

• Trivial programming language• Stack machine• Compiler/interpreter written in C• Demonstrate the basic steps

• Textbook: Modern Compiler Design 1.2

30

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

(synthesis)

Compiler

Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

31

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

32

Source Language

• Fully parameterized expressions• Arguments can be a single digit

(4 + (3 * 9))✗3 + 4 + 5✗(12 + 3)

expression digit | ‘(‘ expression operator expression ‘)’operator ‘+’ | ‘*’digit ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’

33

The abstract syntax tree (AST)

• Intermediate program representation• Defines a tree

– Preserves program hierarchy• Generated by the parser• Keywords and punctuation symbols are not

stored – Not relevant once the tree exists

34

Concrete syntax tree# for 5*(a+b)

expression

number expression‘*’

identifier

expression‘(’ ‘)’

‘+’ identifier

‘a’ ‘b’

‘5’

#Parse tree

35

Abstract Syntax tree for 5*(a+b)

‘*’

‘+’

‘a’ ‘b’

‘5’

36

Annotated Abstract Syntax tree

‘*’

‘+’

‘a’ ‘b’

‘5’

type:real

loc: reg1

type:real

loc: reg2

type:real

loc: sp+8

type:real

loc: sp+24

type:integer

37

Driver for the toy compiler/interpreter

#include "parser.h" /* for type AST_node */#include "backend.h" /* for Process() */#include "error.h" /* for Error() */

int main(void) { AST_node *icode;

if (!Parse_program(&icode)) Error("No top-level expression"); Process(icode);

return 0;}

38

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

39

Lexical Analysis

• Partitions the inputs into tokens– DIGIT– EOF– ‘*’– ‘+’– ‘(‘– ‘)’

• Each token has its representation• Ignores whitespaces

40

lex.h: Header File for Lexical Analysis

/* Define class constants */

/* Values 0-255 are reserved for ASCII characters */

#define EoF 256

#define DIGIT 257

typedef struct {

int class;

char repr;} Token_type;

extern Token_type Token;

extern void get_next_token(void);

41

#include "lex.h" token_type Token; // Global variable

void get_next_token(void) { int ch; do { ch = getchar(); if (ch < 0) { Token.class = EoF; Token.repr = '#'; return; } } while (Layout_char(ch)); if ('0' <= ch && ch <= '9') {Token.class = DIGIT;} else {Token.class = ch;} Token.repr = ch;}

static int Layout_char(int ch) { switch (ch) { case ' ': case '\t': case '\n': return 1; default: return 0; }}

Lexical Analyzer

42

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

43

Parser

• Invokes lexical analyzer• Reports syntax errors• Constructs AST

44

Parser Header File

typedef int Operator;

typedef struct _expression {

char type; /* 'D' or 'P' */

int value; /* for 'D' type expression */

struct _expression *left, *right; /* for 'P' type expression */

Operator oper; /* for 'P' type expression */

} Expression;

typedef Expression AST_node; /* the top node is an Expression */

extern int Parse_program(AST_node **);

45

AST for (2 * ((3*4)+9))P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

47

AST for (2 * ((3*4)+9))P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

48

Driver for the Toy Compiler

#include "parser.h" /* for type AST_node */#include "backend.h" /* for Process() */#include "error.h" /* for Error() */

int main(void) { AST_node *icode;

if (!Parse_program(&icode)) Error("No top-level expression"); Process(icode);

return 0;}

49

Source Language

• Fully parenthesized expressions• Arguments can be a single digit

(4 + (3 * 9))✗3 + 4 + 5✗(12 + 3)

expression digit | ‘(‘ expression operator expression ‘)’operator ‘+’ | ‘*’digit ‘0’ | ‘1’ | ‘2’ | ‘3’ | ‘4’ | ‘5’ | ‘6’ | ‘7’ | ‘8’ | ‘9’

50

lex.h: Header File for Lexical Analysis

/* Define class constants */

/* Integers are used to encode characters + special codes */

/* Values 0-255 are reserved for ASCII characters */

#define EoF 256

#define DIGIT 257

typedef struct {

int class;

char repr;} Token_type;

extern Token_type Token;

extern void get_next_token(void);

51

#include "lex.h" token_type Token; // Global variable

void get_next_token(void) { int ch; do { ch = getchar(); if (ch < 0) { Token.class = EoF; Token.repr = '#’; return;} } while (Layout_char(ch));

if ('0' <= ch && ch <= '9') Token.class = DIGIT;

else Token.class = ch;

Token.repr = ch;}

static int Layout_char(int ch) { switch (ch) { case ' ': case '\t': case '\n': return 1; default: return 0; }}

Lexical Analyzer

52

AST for (2 * ((3*4)+9))P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

53

Driver for the Toy Compiler

#include "parser.h" /* for type AST_node */#include "backend.h" /* for Process() */#include "error.h" /* for Error() */

int main(void) { AST_node *icode;

if (!Parse_program(&icode)) Error("No top-level expression"); Process(icode);

return 0;}

54

Parser Environment#include "lex.h”, "error.h”, "parser.h"

static Expression *new_expression(void) { return (Expression *)malloc(sizeof (Expression));}

static int Parse_operator(Operator *oper_p);static int Parse_expression(Expression **expr_p);int Parse_program(AST_node **icode_p) { Expression *expr; get_next_token(); /* start the lexical analyzer */ if (Parse_expression(&expr)) { if (Token.class != EoF) { Error("Garbage after end of program"); } *icode_p = expr; return 1; } return 0;}

55

Top-Down Parsing• Optimistically build the tree from the root to leaves• For every P A1 A2 … An | B1 B2 … Bm

– If A1 succeeds• If A2 succeeds & A3 succeeds & …• Else fail

– Else if B1 succeeds• If B2 succeeds & B3 succeeds & ..• Else fail

– Else fail

• Recursive descent parsing– Simplified: no backtracking

• Can be applied for certain grammars

56

static int Parse_expression(Expression **expr_p) { Expression *expr = *expr_p = new_expression(); if (Token.class == DIGIT) { expr->type = 'D'; expr->value = Token.repr - '0'; get_next_token(); return 1; } if (Token.class == '(') { expr->type = 'P'; get_next_token(); if (!Parse_expression(&expr->left)) { Error("Missing expression"); } if (!Parse_operator(&expr->oper)) { Error("Missing operator"); } if (!Parse_expression(&expr->right)) { Error("Missing expression"); } if (Token.class != ')') { Error("Missing )"); } get_next_token(); return 1; } /* failed on both attempts */ free_expression(expr); return 0;}

Parser

static int Parse_operator(Operator *oper) { if (Token.class == '+') { *oper = '+'; get_next_token(); return 1; } if (Token.class == '*') { *oper = '*'; get_next_token(); return 1; } return 0;}

57

AST for (2 * ((3*4)+9))

P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

58

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

59

Semantic Analysis

• Trivial in our case• No identifiers• No procedure / functions• A single type for all expressions

60

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

61

Intermediate Representation

P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

62

Alternative IR: 3-Address Code

L1:_t0=a_t1=b_t2=_t0*_t1_t3=d_t4=_t2-_t3GOTO L1

“Simple Basic-like programming language”

63

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

64

Code generation

• Stack based machine• Four instructions

– PUSH n– ADD– MULT– PRINT

65

Code generation#include "parser.h" #include "backend.h" static void Code_gen_expression(Expression *expr) { switch (expr->type) { case 'D': printf("PUSH %d\n", expr->value); break; case 'P': Code_gen_expression(expr->left); Code_gen_expression(expr->right); switch (expr->oper) { case '+': printf("ADD\n"); break; case '*': printf("MULT\n"); break; } break; }}void Process(AST_node *icode) { Code_gen_expression(icode); printf("PRINT\n");}

66

Compiling (2*((3*4)+9))

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

67

Executing Compiled Program

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

68

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack Stack’

2

69

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

3

2

Stack

2

70

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

4

3

2

Stack

3

2

71

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

12

2

Stack

4

3

2

72

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

9

12

2

Stack

12

2

73

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

21

2

Stack

9

12

2

74

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’

42

Stack

21

2

75

Generated Code Execution

PUSH 2

PUSH 3

PUSH 4

MULT

PUSH 9

ADD

MULT

PRINT

Stack’Stack

42

76

Shortcuts

• Avoid generating machine code• Use local assembler• Generate C code

77

Structure of toy Compiler / interpreter

Executable code

exe

Source

text

txt

Semantic

Representation

Backend (synthesi

s)Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

(NOP)

IntermediateRepresentati

on

(AST)

Code

Generation

Execution

Engine

Execution Engine Output*

* Programs in our PL do not take input

78

Interpretation

• Bottom-up evaluation of expressions• The same interface of the compiler

79

#include "parser.h" #include "backend.h”

static int Interpret_expression(Expression *expr) { switch (expr->type) { case 'D': return expr->value; break; case 'P': int e_left = Interpret_expression(expr->left); int e_right = Interpret_expression(expr->right); switch (expr->oper) { case '+': return e_left + e_right; case '*': return e_left * e_right; break; }}

void Process(AST_node *icode) { printf("%d\n", Interpret_expression(icode));}

80

Interpreting (2*((3*4)+9))

P

*oper

typeleft right

P

+

P

*

D

2

D

9

D

4

D

3

81

Summary: Journey inside a compiler

LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

x = b*b – 4*a*c

txt

<ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

TokenStream

82LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

<ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

‘b’ ‘4’

‘b’‘a’

‘c’

ID

ID

ID

ID

ID

factor

term factorMULT

term

expression

expression

factor

term factorMULT

term

expression

term

MULT factor

MINUS

SyntaxTree

Summary: Journey inside a compiler

Statement

‘x’

ID EQ

83LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

‘b’ ‘4’

‘b’‘a’

‘c’

ID

ID

ID

ID

ID

factor

term factorMULT

term

expression

expression

factor

term factorMULT

term

expression

term

MULT factor

MINUS

SyntaxTree

Summary: Journey inside a compiler<ID,”x”> <EQ> <ID,”b”> <MULT> <ID,”b”> <MINUS> <INT,4> <MULT> <ID,”a”> <MULT> <ID,”c”>

84Sem.Analysi

s

Inter.Rep.

Code Gen.

‘b’

‘4’

‘b’

‘a’

‘c’

MULT

MULT

MULT

MINUS

LexicalAnalysi

s

Syntax Analysi

s

AbstractSyntaxTree

Summary: Journey inside a compiler

85LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

‘b’

‘4’

‘b’

‘a’

‘c’

MULT

MULT

MULT

MINUS

type: intloc: sp+8

type: intloc: const

type: intloc: sp+16

type: intloc: sp+16

type: intloc: sp+24

type: intloc: R2

type: intloc: R2

type: intloc: R1

type: intloc: R1

AnnotatedAbstractSyntaxTree

Summary: Journey inside a compiler

86

Journey inside a compiler

LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

Inter.Rep.

Code Gen.

‘b’

‘4’

‘b’

‘a’

‘c’

MULT

MULT

MULT

MINUS

type: intloc: sp+8

type: intloc: const

type: intloc: sp+16

type: intloc: sp+16

type: intloc: sp+24

type: intloc: R2

type: intloc: R2

type: intloc: R1

type: intloc: R1

R2 = 4*aR1=b*bR2= R2*cR1=R1-R2

IntermediateRepresentation

87

Journey inside a compiler

Inter.Rep.

Code Gen.

‘b’

‘4’

‘b’

‘a’

‘c’

MULT

MULT

MULT

MINUS

type: intloc: sp+8

type: intloc: const

type: intloc: sp+16

type: intloc: sp+16

type: intloc: sp+24

type: intloc: R2

type: intloc: R2

type: intloc: R1

type: intloc: R1

R2 = 4*aR1=b*bR2= R2*cR1=R1-R2

MOV R2,(sp+8)SAL R2,2MOV R1,(sp+16)MUL R1,(sp+16)MUL R2,(sp+24)SUB R1,R2

LexicalAnalysi

s

Syntax Analysi

s

Sem.Analysi

s

IntermediateRepresentation

AssemblyCode

88

Error Checking

• In every stage…

• Lexical analysis: illegal tokens• Syntax analysis: illegal syntax • Semantic analysis: incompatible types, undefined

variables, …

• Every phase tries to recover and proceed with compilation (why?)– Divergence is a challenge

89

The Real Anatomy of a Compiler

Executable

code

exe

Source

text

txtLexicalAnalysi

s

Sem.Analysis

Process text input

characters SyntaxAnalysi

s

tokens AST

Intermediate code

generation

Annotated AST

Intermediate code

optimization

IR CodegenerationIR

Target code optimizatio

n

Symbolic Instructions

SI Machine code

generation

Write executable

output

MI

90

Optimizations• “Optimal code” is out of reach

– many problems are undecidable or too expensive (NP-complete)– Use approximation and/or heuristics

• Loop optimizations: hoisting, unrolling, … • Peephole optimizations• Constant propagation

– Leverage compile-time information to save work at runtime (pre-computation)

• Dead code elimination– space

• …

91

Machine code generation

• Register allocation– Optimal register assignment is NP-Complete– In practice, known heuristics perform well

• assign variables to memory locations• Instruction selection

– Convert IR to actual machine instructions

• Modern architectures– Multicores– Challenging memory hierarchies

92

And on a More General Note

93

Course Goals

• What is a compiler• How does it work• (Reusable) techniques & tools

• Programming language implementation– runtime systems

• Execution environments– Assembly, linkers, loaders, OS

94

To Compilers, and Beyond …

• Compiler construction is successful– Clear problem – Proper structure of the solution– Judicious use of formalisms

• Wider application– Many conversions can be viewed as

compilation• Useful algorithms

95

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

(synthesis)

Compiler

Frontend

(analysis)

96

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

(synthesis)

Compiler

Frontend

(analysis)

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

97

Judicious use of formalisms

• Regular expressions (lexical analysis)• Context-free grammars (syntactic analysis)• Attribute grammars (context analysis)• Code generator generators (dynamic programming)

• But also some nitty-gritty programming

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

98

Use of program-generating tools

• Parts of the compiler are automatically generated from specification

Stream of tokens

Jlex

regular expressions

input program scanner

99

Use of program-generating tools

• Parts of the compiler are automatically generated from specification

Jcup

Context free grammar

Stream of tokens parser Syntax tree

100

Use of program-generating tools• Simpler compiler construction

– Less error prone– More flexible

• Use of pre-canned tailored code– Use of dirty program tricks

• Reuse of specification

toolspecification

input (generated) code

output

101

Compiler Construction Toolset

• Lexical analysis generators– Lex, JLex

• Parser generators– Yacc, Jcup

• Syntax-directed translators• Dataflow analysis engines

102

Wide applicability

• Structured data can be expressed using context free grammars– HTML files– Postscript– Tex/dvi files– …

103

Generally useful algorithms

• Parser generators• Garbage collection• Dynamic programming• Graph coloring

104

How to write a compiler?

105

How to write a compiler?

L1 CompilerExecutable compiler

exe

L2 Compiler source

txtL1

106

How to write a compiler?

L1 CompilerExecutable compiler

exe

L2 Compiler source

txtL1

L2 CompilerExecutable program

exe

Program source

txtL2

=

107

How to write a compiler?

L1 CompilerExecutable compiler

exe

L2 Compiler source

txtL1

L2 CompilerExecutable program

exe

Program source

txtL2

=

107

ProgramOutput

Y

Input

X

=

108

Bootstrapping a compiler

109

Bootstrapping a compiler

L1 Compilersimple

L2 executable compiler

exeSimple

L2 compiler source

txtL1

L2s CompilerInefficient adv.

L2 executable compiler

exeadvanced

L2 compiler source

txtL2

L2 CompilerEfficient adv.

L2 executable compiler

Yadvanced

L2 compiler source

X

=

=

110

Proper Design

• Simplify the compilation phase– Portability of the compiler frontend– Reusability of the compiler backend

• Professional compilers are integrated

Java

C

Pascal

C++

ML

Pentium

MIPS

Sparc

Java

C

Pascal

C++

ML

Pentium

MIPS

Sparc

IR

111

Modularity

SourceLanguage 1

txt

SemanticRepresentation

Backend

TL2

Frontend

SL2

int a, b;a = 2;b = a*2 + 1;

MOV R1,2SAL R1INC R1MOV R2,R1

Frontend

SL3

Frontend

SL1

Backend

TL1

Backend

TL3

SourceLanguage 1

txt

SourceLanguage 1

txt

Executabletarget 1

exe

Executabletarget 1

exe

Executabletarget 1

exe

SET R1,2STORE #0,R1SHIFT R1,1STORE #1,R1ADD R1,1STORE #2,R1

112

113

Lexical Analysis

Modern Compiler Design: Chapter 2.1

114

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

Compiler

Frontend

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

115

Conceptual Structure of a Compiler

Executable

code

exe

Source

text

txt

Semantic

Representation

Backend

Compiler

Frontend

LexicalAnalysi

s

Syntax Analysi

s

Parsing

Semantic

Analysis

IntermediateRepresentati

on

(IR)

Code

Generation

words sentences

116

What does Lexical Analysis do?

• Language: fully parenthesized expressionsExpr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

( ( 23 + 7 ) * 19 )

117

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

118

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

119

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

120

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RP

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

121

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RPKind

Value

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

122

What does Lexical Analysis do?

• Language: fully parenthesized expressionsContext free

language

Regularlanguages

( ( 23 + 7 ) * 19 )

LP LP Num Op Num RP Op Num RPKind

Value

Expr Num | LP Expr Op Expr RPNum Dig | Dig NumDig ‘0’|‘1’|‘2’|‘3’|‘4’|‘5’|‘6’|‘7’|‘8’|‘9’LP ‘(’RP ‘)’Op ‘+’ | ‘*’

Token Token …

123

• Partitions the input into stream of tokens– Numbers– Identifiers– Keywords– Punctuation

• Usually represented as (kind, value) pairs– (Num, 23)– (Op, ‘*’)

• “word” in the source language• “meaningful” to the syntactical analysis

What does Lexical Analysis do?

124

From scanning to parsing((23 + 7) * x)

) ? * ) 7 + 23 ( (

RP Id OP RP Num OP Num LP LP

Lexical Analyzer

program text

token stream

ParserGrammar: Expr ... | Id Id ‘a’ | ... | ‘z’

Op(*)

Id(?)

Num(23) Num(7)

Op(+)

Abstract Syntax Tree

validsyntaxerror

125

Why Lexical Analysis?

• Well, not strictly necessary, but …– Regular languages Context-Free languages

• Simplifies the syntax analysis (parsing) – And language definition

• Modularity• Reusability • Efficiency

126

Lecture goals

• Understand role & place of lexical analysis

• Lexical analysis theory• Using program generating tools

127

Lecture Outline

Role & place of lexical analysis• What is a token?• Regular languages• Lexical analysis• Error handling• Automatic creation of lexical analyzers

128

What is a token? (Intuitively)

• A “word” in the source language– Anything that should appear in the input to

syntax analysis• Identifiers• Values• Language keywords

• Usually, represented as a pair of (kind, value)

129

Example TokensType Examples

ID foo, n_14, lastNUM 73, 00, 517, 082 REAL 66.1, .5, 5.5e-10IF ifCOMMA ,NOTEQ !=LPAREN (RPAREN )

130

Example Non TokensType Examples

comment /* ignored */preprocessor directive #include <foo.h>

#define NUMS 5.6macro NUMSwhitespace \t, \n, \b, ‘ ‘

131

Some basic terminology

• Lexeme (aka symbol) - a series of letters separated from the rest of the program according to a convention (space, semi-column, comma, etc.)

• Pattern - a rule specifying a set of strings.Example: “an identifier is a string that starts with a letter and continues with letters and digits”– (Usually) a regular expression

• Token - a pair of (pattern, attributes)

132

Examplevoid match0(char *s) /* find a zero */

{

if (!strncmp(s, “0.0”, 3))

return 0. ;

}

VOID ID(match0) LPAREN CHAR DEREF ID(s) RPAREN

LBRACE

IF LPAREN NOT ID(strncmp) LPAREN ID(s) COMMA STRING(0.0) COMMA NUM(3) RPAREN RPAREN

RETURN REAL(0.0) SEMI

RBRACE

EOF

133

Example Non TokensType Examples

comment /* ignored */preprocessor directive #include <foo.h>

#define NUMS 5.6macro NUMSwhitespace \t, \n, \b, ‘ ‘

• Lexemes that are recognized but get consumed rather than transmitted to parser– If– i/*comment*/f

134

How can we define tokens?

• Keywords – easy!– if, then, else, for, while, …

• Identifiers? • Numerical Values?• Strings?

• Characterize unbounded sets of values using a bounded description?

135

Lecture Outline

Role & place of lexical analysisWhat is a token?• Regular languages• Lexical analysis• Error handling• Automatic creation of lexical analyzers

136

Regular languages

• Formal languages– Σ = finite set of letters– Word = sequence of letter– Language = set of words

• Regular languages defined equivalently by– Regular expressions– Finite-state automata

137

Common format for reg-expsBasic Patterns Matching

x The character x

. Any character, usually except a new line

[xyz] Any of the characters x,y,z

^x Any character except x

Repetition Operators

R? An R or nothing (=optionally an R)

R* Zero or more occurrences of R

R+ One or more occurrences of R

Composition Operators

R1R2 An R1 followed by R2

R1|R2 Either an R1 or R2

Grouping

(R) R itself

138

Examples

• ab*|cd? = • (a|b)* =• (0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9)* =

139

Escape characters

• What is the expression for one or more + symbols?– (+)+ won’t work– (\+)+ will

• backslash \ before an operator turns it to standard character– \*, \?, \+, a\(b\+\*, (a\(b\+\*)+, …

• backslash double quotes surrounds text– “a(b+*”, “a(b+*”+

140

Shorthands

• Use names for expressions– letter = a | b | … | z | A | B | … | Z– letter_ = letter | _– digit = 0 | 1 | 2 | … | 9– id = letter_ (letter_ | digit)*

• Use hyphen to denote a range– letter = a-z | A-Z– digit = 0-9

141

Examples

• if = if• then = then• relop = < | > | <= | >= | = | <>

• digit = 0-9• digits = digit+

142

Example

• A number is

number = ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+

( | . ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+

( | E ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 )+

) )

• Using shorthands it can be written as

number = digits ( | .digits ( | E (|+|-) digits ) )

143

Exercise 1 - Question

• Language of rational numbers in decimal representation (no leading, ending zeros)– 0– 123.757– .933333– Not 007– Not 0.30

144

Exercise 1 - Answer

• Language of rational numbers in decimal representation (no leading, ending zeros)

– Digit = 1|2|…|9Digit0 = 0|DigitNum = Digit Digit0*Frac = Digit0* Digit Pos = Num | .Frac | 0.Frac| Num.FracPosOrNeg = (Є|-)PosR = 0 | PosOrNeg

145

Exercise 2 - Question

• Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …

146

Exercise 2 - Answer

• Equal number of opening and closing parenthesis: [n]n = [], [[]], [[[]]], …

• Not regular• Context-free• Grammar: S ::= [] | [S]

147

Challenge: Ambiguity

• if = if• id = letter_ (letter_ | digit)*

• “if” is a valid word in the language of identifiers… so what should it be?

• How about the identifier “iffy”?

• Solution– Always find longest matching token– Break ties using order of definitions… first definition

wins (=> list rules for keywords before identifiers)

148

Creating a lexical analyzer

• Given a list of token definitions (pattern name, regex), write a program such that– Input: String to be analyzed– Output: List of tokens

• How do we build an analyzer?

149