1 foundations of software design lecture 24: compilers, lexers, and parsers; intro to graphs marti...

Post on 15-Jan-2016

223 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

1

Foundations of Software Design

Lecture 24: Compilers, Lexers, and Parsers; Intro to Graphs Marti HearstFall 2002 

2

How Do Computers Work (Revisited)?

Bits & Bytes Binary Numbers

Number Systems

Orders of MagnitudeGates

Boolean Logic

Circuits

CPU Machine Instructions

Assembly Language

Programming Languages

Address Space

Code vs. Data

Compiler

3Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Compiler

• What is a compiler? – A recognizer (of some source language L). – A translator (of programs written in L into programs

written in some object or target language L').

• A compiler is itself a program, written in some host language

• Operates in phases

Machine Instructions

Assembly Language

Programming Languages

Compiler

4

Converting Java to Byte Code

• When you compile a java program, javac produces byte codes (stored in the class file).

• The byte codes are not converted to machine code.

• Instead, they are interpreted in the VM when you run the program called java.

5

Machine Code

Assembly Language

C codeTranslatedby the Ccompiler(gcc or cc)

Byte code (class file)

Java codeTranslatedby the javacompiler (javac or jit)

Java Virtual Machine

Creates theJVM once

Individual program isloaded & run in JVM

6

Compiler Compilers

• Which came first: the compiler or the program?– The very first one has to be written in assembly

language!– This is why most programming languages today start

with the C code generator

• After you have created the first compiler for a given language, say java, then you …

• Use that compiler to compile itself!!

7

Compiling Your Compiler

Write the first java compiler using C

Javac in C

Compile using gcc

Write the second java compiler using java

Javac in java

Compile using javac

Write other java programs

Compile using javac

8Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Lexical analyzer (scanner)

Syntax analyzer (parser)

Semantic analyzer

Intermediate Code Generator

Optimizer

Code Generator

Compiler in more detail.

9Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Scanner

• Task: – Translate the sequence of characters into a

corresponding sequence of tokens (by grouping characters into lexemes).

• How it’s done– Specify lexemes using Regular Expressions– Convert these Regular Expressions into Finite Automata

10Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Lexemes and TokensHere are some Java lexemes and the corresponding tokens:

; = index tmp 37 102 SEMI-COLON ASSIGN IDENT IDENT INT-LIT INT-LIT Note that multiple lexemes can correspond to the same token (e.g.,

there are many identifiers).

Given the source code: position = initial + rate * 60 ;

a Java scanner would return the following sequence of tokens:

IDENT ASSIGN IDENT PLUS IDENT TIMES INT-LIT SEMI-COLON

11Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Scanner

• Also called the Lexer• How it works:

– Reads characters from the source program. – Groups the characters into lexemes (sequences of

characters that "go together"). – Each lexeme corresponds to a token;

• the scanner returns the next token (plus maybe some additional information) to the parser.

– The scanner may also discover lexical errors (e.g., erroneous characters).

• The definitions of what is a lexeme, token, or bad character all depend on the source language.

12Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Two kinds of Automata

Deterministic (DFA): – No state has more than one outgoing edge with the

same label.

Non-Deterministic (NFA):– States may have more than one outgoing edge with

same label.– Edges may be labeled with (epsilon), the empty

string. – The automaton can take an epsilon transition

without looking at the current input character.

13Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Regular Expressions to Finite Automata

• Generating a scanner

Regularexpressions

NFA

DFA

LexicalSpecification

Table-driven Implementation of DFA

14

BNF

• Backus-Naur form, Backus-Normal form– A set of rules (or productions)– Each of which expresses the ways symbols of the

language can be grouped together• Non-terminals are written upper-case• Terminals are written lower-case• The start symbol is the left-hand side of the first

production

• The rules for a CFG are often referred to as its BNF

15Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Java Identifier Definition

Described in the Java specification:– http://java.sun.com/docs/books/jls/second_edition/html/lexical.doc.ht

ml#44591

– “An identifier is an unlimited-length sequence of Java letters and Java digits, the first of which must be a Java letter.

– An identifier cannot have the same spelling (Unicode character sequence) as a keyword (§3.9), Boolean literal (§3.10.3), or the null literal (§3.10.7).”

16Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Java Identifier Definition

17Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Java Integer Literals

• An integer literal may be expressed in decimal (base 10), hexadecimal (base 16), or octal (base 8)

• Examples:0 2 0372 0xDadaCafe 1996 0x00FF00FF

(opt means optional)

18Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Defining Java Decimal Numerals

A decimal numeral is either the single ASCII character 0, representing the integer zero, or consists of an ASCII digit from 1 to 9, optionally followed by one or more ASCII digits from 0 to 9, representing a positive integer:

19Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Defining Floating-Point LiteralsA floating-point literal has the following parts: a whole-number part, a decimal point (represented by an ASCII period character), a fractional part, an exponent, and a type suffix. The exponent, if present, is indicated by the ASCII letter e or E followed by an optionally signed integer.

20

From the Lucene HTML Scanner

21Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Functionality of the Parser

• Input: sequence of tokens from lexical analysis

• Output: parse tree of the program – parse tree is generated if the input is a legal program– if input is an illegal program, syntax errors are issued

• Note: – Instead of parse tree, some parsers produce directly:

• abstract syntax tree (AST) + symbol table, or• intermediate code, or• object code

22Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Parser vs. Scanner

Phase Input Output

Scanner String of characters

String of tokens

Parser String of tokens Parse tree

23Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Parser• Groups tokens into "grammatical phrases", discovering

the underlying structure of the source program. • Finds syntax errors.

– Example • position = * 5 ;

– corresponds to the sequence of tokens: IDENT ASSIGN TIMES INT-LIT SEMI-COLON

– All are legal tokens, but that sequence of tokens is erroneous. • Might find some "static semantic" errors, e.g., a use of an

undeclared variable, or variables that are multiply declared.

• Might generate code, or build some intermediate representation of the program such as an abstract-syntax tree.

24Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

What must the parser do?1. Recognizer: not all strings of tokens are programs

– must distinguish between valid and invalid strings of tokens

2. Translator: must expose program structure• e.g., associativity and precedence• must return the parse tree

We need:– A language for describing valid strings of tokens

• context-free grammars• (analogous to regular expressions in the scanner)

– A method for distinguishing valid from invalid strings of tokens (and for building the parse tree)• the parser• (analogous to the state machine in the scanner)

25Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Parser Example

position = initial + rate * 60 ;

=

+

*

position

initial

rate 60

26Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Semantic Analyzer• The semantic analyzer checks for (more) "static

semantic" errors, e.g., type errors. • Annotates and/or changes the abstract syntax tree

– (e.g., it might annotate each node that represents an expression with its type).

– Example with before and after:

=

+

*position

initial

rate 60

=

+

*position

initial

rate

60

(float)

(float)

(float)(float)

(float)

(float) int-to-float()

(float)

(int)

27Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Intermediate Code Generator

The intermediate code generator translates from abstract-syntax tree to intermediate code.

– One possibility is 3-address code. – Here's an example of 3-address code for the abstract-

syntax tree shown above:

temp1 = int-to-float(60)temp2 = rate * temp1 temp3 = initial + temp2 position = temp3

28Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Optimizer

• Examine the program and rewrite it in ways the preserve the meaning but are more efficient.

• Incredibly complex programs and algorithms• Example

– Move the declaration of temp outside the loop so it isn’t re-declared every time the loop is executed

– Change 2*5 to 10 since it is a constant (no need to do an expensive multiply at run time)

– If we removed the line with temp, the program might even skip the loop altogether

• You can see in advance that count ends up = 30

int count = 0;for (int j=0; j < 2*5; j++) { int temp = j + 1; count += 3;}

29Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

The Code Generator

• The code generator generates object code from (optimized) intermediate code.

LOADF rate,R1 MULF #60.0,R1 LOADF initial,R2 ADDF R2,R1 STOREF R1,position

30Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Tools

• Scanner Generator– Used to create a scanner automatically– Input:

• a regular expression for each token to be recognized

– Output:• a finite state machine

– Examples:• lex or flex (produce C code), or jlex (produce java)

• Compiler Compilers• yacc (produces C) or JavaCC (produces Java, also has a

scanner generator).

31

From the Lucene HTML Parser

32

From the Lucene HTML Parser

33

Graphs / Networks

34Slide adapted from Goodrich & Tamassia

What is a Graph?

35Slide adapted from Goodrich & Tamassia

36Slide adapted from Goodrich & Tamassia

37Slide adapted from Goodrich & Tamassia

38Slide adapted from Goodrich & Tamassia

39Slide adapted from Goodrich & Tamassia

40Slide adapted from Goodrich & Tamassia

41Slide adapted from Goodrich & Tamassia

42Slide adapted from Goodrich & Tamassia

43Slide adapted from Goodrich & Tamassia

44Slide adapted from Goodrich & Tamassia

45Slide adapted from Goodrich & Tamassia

46Slide adapted from Goodrich & Tamassia

47Adapted from lecture by Ras Bodik, http://www.cs.wisc.edu/~bodik/cs536.html

Next Time

• Graph Traversal• Directed Graphs (digraphs)• DAGS• Weighted Graphs

top related