1 syntax the grammar of a language. 2 topics to know difference between syntax and semantics. four...

1

Syntax

The Grammar of a Language

2

Topics to Know

Difference between syntax and semantics. Four categories of languages. Regular expressions

used for pattern matching and extracting info regular expressions are an essential part of many

programming languages ... memorize them! BNF and EBNF

used to describe grammar of computer languages can be used to automatically generate a parser for a

language

3

Syntax and Semantics The syntax of a language defines the valid symbols

and grammar.

Syntax defines the structure of a program, i.e., the form that each program unit and each statement must use.

The semantics defines the meaning of the grammar elements.

Lexical structure is the form of lowest level syntactic units (words or tokens) of a grammar.

4

Syntax and Semantics Compared Syntax: in Java, an assignment statement is:

identifier = expression { operator expression } ;

Semantics: an assignment statement must use compatible types, e.g.

int n1, n2;

n1 = 20*1024; // OK, int_var = int_expressionn2 = 3.50; // illegal, incompatible types

Lexical elements (tokens):

"n2" "=" "3.50" ";"

5

Syntax and Semantics Compared Syntax: the form of a while statement is:

while ( boolean_expression ) statement ;

Semantics:

when a thread of execution encounters a while statement the boolean expression is tested. If the expression evaluates to true then the statement is executed and the process is repeated. If [not when] the expression evaluates to false, execution continues to the next statement. ...

6

How are they used?Program

Source Code

Token stream

Parse tree

Intermediate code

Tokenizer (Lexical Analysis)

Parser (Syntax Analysis)

Semantic Analysis

Optimization and Code Generation

Object code

Parts of a Compiler / Interpreter:

7

Scanning and Parsing

source file

Tokenizer

Parser

input stream

parse tree

sum = x1 + x2;

sum = x1 + x2;

assignment:sum

= + x1 x2

tokens

8

Scanners Recognize regular expressions

Implemented as finite automata (finite state machines)

Typically contain a loop that cycles through characters, building tokens and associated values by repeated operations

scanner may be integrated as a function in the parser.

Parser calls the Scanner to get the next token.

9

Parsers Recognize patterns defined by grammar rules

Implemented as pushdown automata

Convert a stream of tokens (supplied by the scanner) into a parse tree containing symbols defined in the grammar.

Symbols are things like "assignment", "expression"

Parsing is more difficult than scanning.

10

Formal Languages

Famed linguist Noam Chomsky introduced a formal classification of (human) languages. In terms of computing theory, his categories are:

Hierarchy Grammars Languages Minimal Automaton

Type 0 unrestricted Recursive Enumerable Turing machine

Recursive Decider

Type 1 context-sensitive Context-sensitive Linear-bounded

Type 2 context-free Context-free Pushdown

Type 3 regular Regular Deterministic Finite

Each class in this hierarchy is a subset of the class above it.

A context-free grammar is a grammar where the syntax of each constituent is independent of the symbols that come before and after it.

11

Applying Formal Languages (1)

Tokenizer or Scanner (Lexical Analysis): The lexemes (tokens) in a computer language are a

"regular grammar" (Type 3). Therefore, we can use the simplest grammar processor

to make a tokenizer. rules for tokens are defined using regular expressions.

Examples:

integer ::= "[+-]?[0-9]+" (actually, this is too simple)

integer ::= "[+-]?(0[0-7]*|[1-9][0-9]*)"

(better)

identifier ::= "[A-Za-z_][A-Za-z_0-9]*"

12

Applying Formal Languages (2)

Syntax Analysis The syntax of a computer language is a "Context-free

Grammar" ...almost. We can use a "type 2" grammar processor as a parser. Rules defined in Backus-Naur Form and Extended BNF

Example:

expression ::= expression + term | expression - term | term

term ::= term * factor | term / factor | factor

factor ::= ( expression ) | NUMBER

13

Lexical Structure Lexemes are the smallest lexical unit of a language,

grouped according to syntactic usage. Some types of lexemes in computer languages are:

identifiers: x, println, _INIT, ArrayList

numeric constants: 0, 10000, 2.98E+6

operators: =, +, -, ++, +=, *, /

separators: [ ] ; : . , ( )

string literals: "hello there" A token is a string representing the value of a lexeme. Lexemes are recognized by the first phase of a

translator -- the scanner -- that deals directly with the input. The scanner separates the input into tokens.

Scanners are also called lexers.

14

Tokens Tokens are the strings of syntactic units. Example: what are the tokens in this statement?result = (sum - average)/count;

Tokens:result identifier= assignment operator( expression delimitersum identifier- arithmetic operatoraverage identifier) expression delimiter/ arithmetic operatorcount identifier; semi-colon (statement delimiter)

15

C tokens Lexical structure of C, defined in The C Programming

Language...“There are six classes of tokens: identifiers, keywords, constants, string literals, operators, and other separators. Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments as described below (collectively, "white space") are ignored except as they separate tokens. Some white space is required to separate otherwise adjacent identifiers, keywords, and constants. If the input stream has been separated into tokens up to a given character, the next token is the longest string of characters that could constitute a token.” [Kernighan and Ritchie, The C Programming Language, 2nd Ed., pp. 191-192.]

C uses the principle of longest match (substring).

16

Principle of Longest Match A token should be the longest string that satisfies a rule for

lexemes. Example: x2+=1.0

tokens: "x2" "+=" "1.0"

NOT: "x" "2" "+" "=" "1" "." "0" What are the tokens in these inputs?

x = y+1;

x = y+=1; // tokenizer cannot rely on optional spaces

x = y == 1;

if ( x++ = 1 ) getFirstValue( ); // probably an error

Tokener should not use context or look-ahead more than 1 character. (FORTRAN is an exception)

17

Describing Lexical Structure In C, an identifier (i) begins with a letter or _ (ii) followed

by any number of letters, '_', or digits.

Compare these two examples: which way is simpler?

Rules for C identifiers using EBNF Rules:

identifier ::= ( letter | _ ) { letter | _ | digit }

letter ::= 'A' | 'B' | 'C' | 'D' | . . . | 'Z' | 'a' | 'b' | 'c' | 'd' | . . . | 'z'

digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'

Rule for C identifiers using Regular Expressions:

identifier ::= [A-Za-z_ ][A-Za-z_0-9]*

identifier ::= [\w_][\w_\d]*

18

Rules for Regular Expressions Regular expressions:

x match an occurrence of x

[abcd] match any one of these characters

[A-Z] any one character from this range

[A-Za-z_] any letter or _

x* [a-z]* 0 or more occurrences of these

x+ [a-z]+ 1 or more occurrences of these

x? [a-z]? exactly 0 or 1 occurrence

x{5} match exactly 5 times, same as xxxxx

x{4,8} match between 4 and 8 times

. (period) match any one character

.* match anything!

19

Pattern Matching in Java

java.util.regex.Matcher

matches Strings using regular expressions (patterns)

can be used to extract the thing that is matched

java.util.regex.Pattern

defines objects for regular expressions

String.matches( regex )

tests whether the string values matches a regular exp.

String.split( regex [, maxelements ] )

split a string everyplace the regex is found.

20

Using String matches( )

Java's String match( ) method and the Matcher class.

"hi".matches("[a-z]+") true

"you2".matches("[a-z]+") false

"9".matches("\\d") true, \d := [0-9]

"4@com".matches(".+@.+") anything @ anything

"away".matches("a.*y") true

"ay".matches("a.*y") true

"ay".matches("a.+y") false, at least 1

"123".matches("-?\\d+") true

"-123".matches("-?\\d+") true

"--12".matches("-?\\d+") false

21

Using String split( )

Split a string into pieces anyplace a white space is found

String s = "split me\tat white space ";

String [ ] word = s.split("\\s+");

word[0]="split"

word[1]="me"

word[2]="at"

word[3]="white"

word[4]="space"

22

Character classes and special chars\n\t\r\f\e newline, tab, return, formfeed, escape

\x4E character with hexadecimal value 4E

\u1234 character with Unicode value 1234

[^abcd] any character NOT (^) in this set.

^ must be FIRST CHAR after "["

\d any digit, same meaning as [0-9]

\D any non-digit,same as [^0-9]

\s any white space, [ \t\n\r\f\x08]

\S any non-whitespace

\w any word character, [a-zA-Z0-9_]

\W any non-word character, [^a-zA-Z0-9_]

( ... ) pattern group

23

Examples using Character Classes

Match a valid student ID for this class:

4[6-9]\d{6}

A C identifier: begins with a letter or underscore, followed by any number of letters, digits, or underscore

[A-Za-z_]\w*

Hello in Thai:

\u0E2A\u0E27\u0E31\u0E2A\u0E14\u0E35

24

Positional Matching^ match beginning of a line

$ match the end of a line

\b match a word boundary

Match a student ID at the beginning of the string, as a word by itself:

Matcher m = Pattern.compile("^\d{8}\b.*");

m.match("47541234 Joe Hacker"); // match

m.match("123456789 too long"); // no match

m.match("My ID is 48541234"); // no match

25

Regular Expression for the TimeWrite a pattern to match a time string of the form

"hh:mm:ss am" or "hh:mm:ss PM"; use a 12 hour clock (am,pm). "am", "pm" "AM", "PM" 1.

Example: 3:38:09 am,

9:59:59 AM,

12:47:38 PM.

26

Regular Expression for the Time

Write a pattern to match a time string of the form "hh:mm:ss am" or "hh:mm:ss PM"; use a 12 hour clock (am,pm). "am", "pm" can be uppercase or lowercase.

Example: 3:38:09 am, 12:47:38 PM.

Don't allow nonsense like 33:82:61 am

1?[0-9]:[0-5][0-9]:[0-5][0-9] +[AaPp][Mm]

Another way, using a group and repetition:1?[0-9](:[0-5][0-9]){2} +[AaPp][Mm]

If the seconds are optional (8:33 am) then use:1?[0-9](:[0-5][0-9]){1:2} +[AaPp][Mm]

27

More Matching in Java

Examples using \w, \d

"hello".matches("\\w+") true

"you2!".matches("\\w+") false

"10900".matches("\\d{5}") true: match zipcode

String "split": String [ ] split( regular_expression )

String s = "I like java";

String [] w = s.split("\\s+")

returns:

w[0] = "I", w[1] = "like" w[2] = "java"

28

Pattern Extraction

Using a Matcher object, you can also find the position of a match, and extract the last matched string.

import java.util.regex.*;...Pattern pattern = Pattern.compile( "4754\d\d\d\d" );System.out.println("enter input line to scan: ");String text = console.readLine( );Matcher matcher = pattern.matcher( text );if ( matcher.matches() ) {

String id = matcher.group(); // get the stringSystem.out.println("found " + id);while ( matcher.find() ) { // find next match

id = matcher.group( ); // get the stringSystem.out.println("found " + id);

}}

29

Groups and Pattern Extraction

( expression )

( ) defines a group that you want to re-use or extract.

\n

refers to n-th group matched using ( ).

What strings does this pattern match?

s.match( "^(\w+).*\1$" )

30

NOT Regular Expressions Don't confuse regular expressions (not part of EBNF)

with BNF / EBNF notation.Unfortunately, many sources use a hybrid of EBNF and regular expressions.

In regular expressions, ( ... ) is used for grouping, not a list of choices.

Wrong: bogus notation (looks more like EBNF):

identifier ::= ([A-Z][a-z]_ )([A-Z][a-z][0-9]_)*

Here, ( ) means "any one of these characters"

(abc) means "a" or "b" or "c"

31

Why Use Regular Expressions? Can be directly translated into source code for a

tokenizer. Shorter than [E]BNF Many applications and many languages use them.

How to Learn Regular Expressions search the web -- many tutorials Core Java, p. 698-702. Java regular expressions not

exactly same as syntax in C, Perl, or flex. Java API for "Pattern" class. Perl or Flex book - define regex for these languages

32

Practice Write a lexical description (using a regular expression)

for: base 10 constants (1234, -1234) octal constants (0377) hexadecimal constants (0x2FA84D, 0Xeeef) floating point constants, with optional exponent

Write a Java method to find and extract all the words in a string; a word is a sequence of letters delimited by a non-letter or the begin/end of a line in the string.

Write a Java method to remove /* ... */ comments from a string.

33

Types of lexemes

Common Lexemes (classes of tokens)identifiers: x, println, _INIT, ArrayList

numeric constants: 0, 10000, 2.98E+6

assignment operators: =, +=, -=, *=, /=, %=

arithmetic operators: *, /, +, -, %

boolean operators: &&, ||, ^, !

separators: [ ] ; : . , ( )

string literals: "hello there" Defining many lexemes makes the syntactic grammar

more precise Reserved words: may be defined as a class, or simply

treat as identifiers at lexical level

34

White space and comments “Internal” tokens of the scanner that are matched and

discarded Typical white space: newlines, tabs, spaces Comments:

/* … */, // … \n (C, C++, C#, Java) # … \n (Perl, Unix Shells) (* … *) (Pascal, ML) ; … \n (Scheme)

Comments generally not nested. Comments & white space ignored except that they

serve as separators of tokens.

35

FORTRAN is an exception

No reserved words:REAL IF, THEN

IF (THEN .GT. 0) IF = THEN Compiler ignores spaces (spaces removed before

tokenizing):SUM=0.0S U M = 0 . 0 (same as SUM=0.0) DO 99 I = 1,10 (loop: for i := 1 to 10 )DO 99 I = 1.10 (assignment: DO99I = 1.10 )

This means that parser must "look ahead" to identify syntax.

Lesson: don't remove white space before tokenizing.

36

Reserved words versus key wordsPascal: uses key words such as "integer", "real".

varn: integer;integer: real;

begininteger = 0.5;

"integer" has special meaning in this context

no special meaning here, you can redefine it

C: uses reserved words, such as "int", "float", "return".

Reserved words may not be redefined in a program.int n;float int; Illegal! "int" is reserved.

Reserved words are easier than key words for scanner to recognize, and easier for people to read.

37

Predefined identifiers

Predefined identifiers have special meanings, but can be redefined (although they probably shouldn’t).

Examples of predefined identifiers in Java:String, Object, System, null

in Java, you can define your own String or Object class

Predefined Identifiers are not Reserved Words

Reserved words cannot be used as the name of anything (i.e., as an identifier) except itself.

38

Java "keywords" (reserved words)

abstract continue for new switch

assert default if package synchronized

boolean do goto private this

break double implements protected throw

byte else import public throws

case enum instanceof return transient

catch extends int short try

char final interface static void

class finally long strictfp volatile

const float native super while

The Java Language Specification calls these "key words".

39

Java reserved words (cont.)

The words const and goto are reserved, even though they are not used in the Java language.

Why (do you think) Java reserves "goto" and "const" ?

40


foreach : many languages have a "foreach" statement. In C#:

double [ ] data = new double[100]; ...

foreach( double x in data ) { sum += x; }

Java 5.0 defines a new syntax of "for" to do this:

for(double x : data ) sum += x;

Q: Why did Java use "for" instead of defining a "foreach" ?

What is the disadvantage of defining "foreach(var in collection)"?

41


true and false aren't listed as "keywords" in the language spec. The spec calls them boolean literals (sect 3.10.3). Similarly, null is the null literal (sect 3.10.7).

In actuality they are reserved words!These examples prove it (compiler gives an error msg):

/* Encapsulate a constant :-) */public static true( ) { return true; }public static false( ) { return false; }

/* If true and false are mere constants, we should be allowed to redefine the names locally. */public void illogical( ) {

boolean false = (1==1); // false = trueboolean true = !false; // true = false

42

Categories of Grammar Rules

Declarations or definitions. AttributeDeclaration ::=

[ final ] [ static ] [ access ] datatype [ = expression ]{ , datatype [ = expression ] } ;

access ::= ' public ' | ' protected ' | ' private '

Statements. assignment, if, for, while, do_while

Expressions,

such as the examples in these slides.

Structures such as statement blocks, methods, and entire classes.

StatementBlock ::= '{' { Statement; } '}'

43

Parsing Algorithms (1) Broadly divided into LL and LR.

LL algorithms match input directly to left-side symbols, then choose a right-side production that matches the tokens. This is top-down parsing

LR algorithms try to match tokens to the right-side productions, then replace groups of tokens with the left-side nonterminal. They continue until the entire input has been "reduced" to the start symbol

LALR (look-ahead LR) are a special case of LR; they require a few restrictions to the LR case

Reference: Sebesta, section 4.3 - 4.5.

44

Parsing Algorithms (2) Look ahead:

algorithms must look at next token(s) to decide between alternate productions for current tokens

LALR(1) means LALR with 1 token look-ahead LL(1) means LL with 1 token look-ahead

LL algorithms are simpler and easier to visualize.

LR algorithms are more powerful: can parse some grammars that LL cannot, such as left recursion.

yacc, bison, and CUP generate LALR(1) parsers

Recursive-descent is a useful LL algorithm that "every computer professional should know" [Louden].

45

Top-down Parsing ExampleFor the input: z = (2*x + 5)*y - 7;tokens: ID = ( NUMBER * ID + NUMBER ) * ID - NUMBER ;

Grammar rules (as before):

assignment => ID = expression ;expression => expression + term

| expression - term| term

term => term * factor| term / factor| factor

factor => ( expression )| ID| NUMBER

46

Top-down Parsing Example (2)The top-down parser tries to match input to left sides. In the example, GREEN is part matched to the input so far.

ID = ( NUMBER * ID + NUMBER )* ID - NUMBER ;assignmentID = expression ID = expression - term ;ID = term - term ;ID = term * factor - term ;ID = factor * factor - term ;ID = ( expression * factor - term ;ID = ( expression + term ) * factor - term ; ID = ( term + term ) * factor - term ; ID = ( term * factor + term )* factor - term ; ID = ( factor * ID + factor )* factor - term ;ID = ( NUMBER * ID + NUMBER )* factor - term ;ID = ( NUMBER * ID + NUMBER )* ID - factor ;ID = ( NUMBER * ID + NUMBER )* ID - ID ;

47

Top-down Parsing Example (3)

Problem in example: we had to look ahead many tokens in order to know which production to use.

This isn't necessary provided that we know the grammar is parsable using LL (top-down) methods.

There are conditions on the grammar that we can test to verify this. (see: The Parsing Problem)

Later we will study the recursive-descent algorithm which does top-down parsing with minimal look-ahead.

48

The Parsing Problem

49

The Parsing Problem Top-down parsers must decide which production to use based on

the current symbol, and perhaps "peeking" at the next symbol (or two...).

Predictive parser: a parser that bases its actions on the next available token (called single symbol look-ahead).

Two conditions are necessary: [see Louden, p. 108-110]

The first condition is the ability to choose between multiple alternatives, such as: A 1 | 2 | ... | n

define First() = set of all tokens that can be the first token for any production cascade that produces symbol

then a predictive parser can be used for rule A if:

First(1) First(2) ... First(n) is empty.

50

The Parsing Problem (cont.) The second condition is the ability of the parser to detect presence

of an optional element, such as A [ ]. Can the parser detect for certain whether is present?

Example: list expr [list]. How do we know that list isn't part of expr?

define Follow( ) = set of all tokens that can follow the non-terminal some production. Use a special symbol ($) to represent the end of input if can be the end of input.

Example: Follow( factor ) = { +, -, *, /, ), $ } while Follow( term ) = { *, /, ), $ }

then a predictive parser can detect the presence of optional symbol if First( ) Follow( ) is empty.

51

Review and Thought Questions

52

Lexics vs. Syntax vs. Semantics

Division between lexical and syntactic structure is not fixed:

a number can be a token or defined by a grammar rule. Implementation can often decide:

scanners are faster parsers are more flexible error checking of number format as regex is simpler

Division between syntax and semantics is not fixed: we could define separate rules for IntegerNumber and

FloatingPtNumber , IntegerTerm, FloatingPtTerm, ... in order to specify which mixed-mode operations are allowed.

or specify as part of semantics

53

Numbers: Scan or Parse?

We can construct numbers from digits using the scanner or parser. Which is easier / better ?

Scanner: Define numbers as tokens:

number : [-]\d+

Parser: grammar rules define numbers (digits are tokens):

number '-' unsignednumber | unsignednumber

unsignednumber => unsignednumber digit | digit

digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9

54

Is Java 'Class' grammar context-free?

A class may have static and instance attributes.

An inner class or local class have same syntax as top-level class, but:

may not contain static members (except static constants)

inner class may access outer class using OuterClass.this

local class cannot be "public"

Does this means the syntax for a class depends on context?

55

Alternative operator notation

Some languages use prefix notation: operator comes first

expr + expr expr | * expr expr | NUMBER

Examples:

* + 2 3 4 means (2 + 3) * 4

+ 2 * 3 4 means 2 + (3 * 4)

Using prefix notation, we don't have to worry about precedence of different operators in BNF rules !

1 syntax the grammar of a language. 2 topics to know difference between syntax and semantics. four...

Documents