1 syntax the grammar of a language. 2 topics to know difference between syntax and semantics. four...
TRANSCRIPT
1
Syntax
The Grammar of a Language
2
Topics to Know
Difference between syntax and semantics. Four categories of languages. Regular expressions
used for pattern matching and extracting info regular expressions are an essential part of many
programming languages ... memorize them! BNF and EBNF
used to describe grammar of computer languages can be used to automatically generate a parser for a
language
3
Syntax and Semantics The syntax of a language defines the valid symbols
and grammar.
Syntax defines the structure of a program, i.e., the form that each program unit and each statement must use.
The semantics defines the meaning of the grammar elements.
Lexical structure is the form of lowest level syntactic units (words or tokens) of a grammar.
4
Syntax and Semantics Compared Syntax: in Java, an assignment statement is:
identifier = expression { operator expression } ;
Semantics: an assignment statement must use compatible types, e.g.
int n1, n2;
n1 = 20*1024; // OK, int_var = int_expressionn2 = 3.50; // illegal, incompatible types
Lexical elements (tokens):
"n2" "=" "3.50" ";"
5
Syntax and Semantics Compared Syntax: the form of a while statement is:
while ( boolean_expression ) statement ;
Semantics:
when a thread of execution encounters a while statement the boolean expression is tested. If the expression evaluates to true then the statement is executed and the process is repeated. If [not when] the expression evaluates to false, execution continues to the next statement. ...
6
How are they used?Program
Source Code
Token stream
Parse tree
Intermediate code
Tokenizer (Lexical Analysis)
Parser (Syntax Analysis)
Semantic Analysis
Optimization and Code Generation
Object code
Parts of a Compiler / Interpreter:
7
Scanning and Parsing
source file
Tokenizer
Parser
input stream
parse tree
sum = x1 + x2;
sum = x1 + x2;
assignment:sum
= + x1 x2
tokens
8
Scanners Recognize regular expressions
Implemented as finite automata (finite state machines)
Typically contain a loop that cycles through characters, building tokens and associated values by repeated operations
scanner may be integrated as a function in the parser.
Parser calls the Scanner to get the next token.
9
Parsers Recognize patterns defined by grammar rules
Implemented as pushdown automata
Convert a stream of tokens (supplied by the scanner) into a parse tree containing symbols defined in the grammar.
Symbols are things like "assignment", "expression"
Parsing is more difficult than scanning.
10
Formal Languages
Famed linguist Noam Chomsky introduced a formal classification of (human) languages. In terms of computing theory, his categories are:
Hierarchy Grammars Languages Minimal Automaton
Type 0 unrestricted Recursive Enumerable Turing machine
Recursive Decider
Type 1 context-sensitive Context-sensitive Linear-bounded
Type 2 context-free Context-free Pushdown
Type 3 regular Regular Deterministic Finite
Each class in this hierarchy is a subset of the class above it.
A context-free grammar is a grammar where the syntax of each constituent is independent of the symbols that come before and after it.
11
Applying Formal Languages (1)
Tokenizer or Scanner (Lexical Analysis): The lexemes (tokens) in a computer language are a
"regular grammar" (Type 3). Therefore, we can use the simplest grammar processor
to make a tokenizer. rules for tokens are defined using regular expressions.
Examples:
integer ::= "[+-]?[0-9]+" (actually, this is too simple)
integer ::= "[+-]?(0[0-7]*|[1-9][0-9]*)"
(better)
identifier ::= "[A-Za-z_][A-Za-z_0-9]*"
12
Applying Formal Languages (2)
Syntax Analysis The syntax of a computer language is a "Context-free
Grammar" ...almost. We can use a "type 2" grammar processor as a parser. Rules defined in Backus-Naur Form and Extended BNF
Example:
expression ::= expression + term | expression - term | term
term ::= term * factor | term / factor | factor
factor ::= ( expression ) | NUMBER
13
Lexical Structure Lexemes are the smallest lexical unit of a language,
grouped according to syntactic usage. Some types of lexemes in computer languages are:
identifiers: x, println, _INIT, ArrayList
numeric constants: 0, 10000, 2.98E+6
operators: =, +, -, ++, +=, *, /
separators: [ ] ; : . , ( )
string literals: "hello there" A token is a string representing the value of a lexeme. Lexemes are recognized by the first phase of a
translator -- the scanner -- that deals directly with the input. The scanner separates the input into tokens.
Scanners are also called lexers.
14
Tokens Tokens are the strings of syntactic units. Example: what are the tokens in this statement?result = (sum - average)/count;
Tokens:result identifier= assignment operator( expression delimitersum identifier- arithmetic operatoraverage identifier) expression delimiter/ arithmetic operatorcount identifier; semi-colon (statement delimiter)
15
C tokens Lexical structure of C, defined in The C Programming
Language...“There are six classes of tokens: identifiers, keywords, constants, string literals, operators, and other separators. Blanks, horizontal and vertical tabs, newlines, formfeeds, and comments as described below (collectively, "white space") are ignored except as they separate tokens. Some white space is required to separate otherwise adjacent identifiers, keywords, and constants. If the input stream has been separated into tokens up to a given character, the next token is the longest string of characters that could constitute a token.” [Kernighan and Ritchie, The C Programming Language, 2nd Ed., pp. 191-192.]
C uses the principle of longest match (substring).
16
Principle of Longest Match A token should be the longest string that satisfies a rule for
lexemes. Example: x2+=1.0
tokens: "x2" "+=" "1.0"
NOT: "x" "2" "+" "=" "1" "." "0" What are the tokens in these inputs?
x = y+1;
x = y+=1; // tokenizer cannot rely on optional spaces
x = y == 1;
if ( x++ = 1 ) getFirstValue( ); // probably an error
Tokener should not use context or look-ahead more than 1 character. (FORTRAN is an exception)
17
Describing Lexical Structure In C, an identifier (i) begins with a letter or _ (ii) followed
by any number of letters, '_', or digits.
Compare these two examples: which way is simpler?
Rules for C identifiers using EBNF Rules:
identifier ::= ( letter | _ ) { letter | _ | digit }
letter ::= 'A' | 'B' | 'C' | 'D' | . . . | 'Z' | 'a' | 'b' | 'c' | 'd' | . . . | 'z'
digit ::= '0' | '1' | '2' | '3' | '4' | '5' | '6' | '7' | '8' | '9'
Rule for C identifiers using Regular Expressions:
identifier ::= [A-Za-z_ ][A-Za-z_0-9]*
identifier ::= [\w_][\w_\d]*
18
Rules for Regular Expressions Regular expressions:
x match an occurrence of x
[abcd] match any one of these characters
[A-Z] any one character from this range
[A-Za-z_] any letter or _
x* [a-z]* 0 or more occurrences of these
x+ [a-z]+ 1 or more occurrences of these
x? [a-z]? exactly 0 or 1 occurrence
x{5} match exactly 5 times, same as xxxxx
x{4,8} match between 4 and 8 times
. (period) match any one character
.* match anything!
19
Pattern Matching in Java
java.util.regex.Matcher
matches Strings using regular expressions (patterns)
can be used to extract the thing that is matched
java.util.regex.Pattern
defines objects for regular expressions
String.matches( regex )
tests whether the string values matches a regular exp.
String.split( regex [, maxelements ] )
split a string everyplace the regex is found.
20
Using String matches( )
Java's String match( ) method and the Matcher class.
"hi".matches("[a-z]+") true
"you2".matches("[a-z]+") false
"9".matches("\\d") true, \d := [0-9]
"4@com".matches(".+@.+") anything @ anything
"away".matches("a.*y") true
"ay".matches("a.*y") true
"ay".matches("a.+y") false, at least 1
"123".matches("-?\\d+") true
"-123".matches("-?\\d+") true
"--12".matches("-?\\d+") false
21
Using String split( )
Split a string into pieces anyplace a white space is found
String s = "split me\tat white space ";
String [ ] word = s.split("\\s+");
word[0]="split"
word[1]="me"
word[2]="at"
word[3]="white"
word[4]="space"
22
Character classes and special chars\n\t\r\f\e newline, tab, return, formfeed, escape
\x4E character with hexadecimal value 4E
\u1234 character with Unicode value 1234
[^abcd] any character NOT (^) in this set.
^ must be FIRST CHAR after "["
\d any digit, same meaning as [0-9]
\D any non-digit,same as [^0-9]
\s any white space, [ \t\n\r\f\x08]
\S any non-whitespace
\w any word character, [a-zA-Z0-9_]
\W any non-word character, [^a-zA-Z0-9_]
( ... ) pattern group
23
Examples using Character Classes
Match a valid student ID for this class:
4[6-9]\d{6}
A C identifier: begins with a letter or underscore, followed by any number of letters, digits, or underscore
[A-Za-z_]\w*
Hello in Thai:
\u0E2A\u0E27\u0E31\u0E2A\u0E14\u0E35
24
Positional Matching^ match beginning of a line
$ match the end of a line
\b match a word boundary
Match a student ID at the beginning of the string, as a word by itself:
Matcher m = Pattern.compile("^\d{8}\b.*");
m.match("47541234 Joe Hacker"); // match
m.match("123456789 too long"); // no match
m.match("My ID is 48541234"); // no match
25
Regular Expression for the TimeWrite a pattern to match a time string of the form
"hh:mm:ss am" or "hh:mm:ss PM"; use a 12 hour clock (am,pm). "am", "pm" "AM", "PM" 1.
Example: 3:38:09 am,
9:59:59 AM,
12:47:38 PM.
26
Regular Expression for the Time
Write a pattern to match a time string of the form "hh:mm:ss am" or "hh:mm:ss PM"; use a 12 hour clock (am,pm). "am", "pm" can be uppercase or lowercase.
Example: 3:38:09 am, 12:47:38 PM.
Don't allow nonsense like 33:82:61 am
1?[0-9]:[0-5][0-9]:[0-5][0-9] +[AaPp][Mm]
Another way, using a group and repetition:1?[0-9](:[0-5][0-9]){2} +[AaPp][Mm]
If the seconds are optional (8:33 am) then use:1?[0-9](:[0-5][0-9]){1:2} +[AaPp][Mm]
27
More Matching in Java
Examples using \w, \d
"hello".matches("\\w+") true
"you2!".matches("\\w+") false
"10900".matches("\\d{5}") true: match zipcode
String "split": String [ ] split( regular_expression )
String s = "I like java";
String [] w = s.split("\\s+")
returns:
w[0] = "I", w[1] = "like" w[2] = "java"
28
Pattern Extraction
Using a Matcher object, you can also find the position of a match, and extract the last matched string.
import java.util.regex.*;...Pattern pattern = Pattern.compile( "4754\d\d\d\d" );System.out.println("enter input line to scan: ");String text = console.readLine( );Matcher matcher = pattern.matcher( text );if ( matcher.matches() ) {
String id = matcher.group(); // get the stringSystem.out.println("found " + id);while ( matcher.find() ) { // find next match
id = matcher.group( ); // get the stringSystem.out.println("found " + id);
}}
29
Groups and Pattern Extraction
( expression )
( ) defines a group that you want to re-use or extract.
\n
refers to n-th group matched using ( ).
What strings does this pattern match?
s.match( "^(\w+).*\1$" )
30
NOT Regular Expressions Don't confuse regular expressions (not part of EBNF)
with BNF / EBNF notation.Unfortunately, many sources use a hybrid of EBNF and regular expressions.
In regular expressions, ( ... ) is used for grouping, not a list of choices.
Wrong: bogus notation (looks more like EBNF):
identifier ::= ([A-Z][a-z]_ )([A-Z][a-z][0-9]_)*
Here, ( ) means "any one of these characters"
(abc) means "a" or "b" or "c"
31
Why Use Regular Expressions? Can be directly translated into source code for a
tokenizer. Shorter than [E]BNF Many applications and many languages use them.
How to Learn Regular Expressions search the web -- many tutorials Core Java, p. 698-702. Java regular expressions not
exactly same as syntax in C, Perl, or flex. Java API for "Pattern" class. Perl or Flex book - define regex for these languages
32
Practice Write a lexical description (using a regular expression)
for: base 10 constants (1234, -1234) octal constants (0377) hexadecimal constants (0x2FA84D, 0Xeeef) floating point constants, with optional exponent
Write a Java method to find and extract all the words in a string; a word is a sequence of letters delimited by a non-letter or the begin/end of a line in the string.
Write a Java method to remove /* ... */ comments from a string.
33
Types of lexemes
Common Lexemes (classes of tokens)identifiers: x, println, _INIT, ArrayList
numeric constants: 0, 10000, 2.98E+6
assignment operators: =, +=, -=, *=, /=, %=
arithmetic operators: *, /, +, -, %
boolean operators: &&, ||, ^, !
separators: [ ] ; : . , ( )
string literals: "hello there" Defining many lexemes makes the syntactic grammar
more precise Reserved words: may be defined as a class, or simply
treat as identifiers at lexical level
34
White space and comments “Internal” tokens of the scanner that are matched and
discarded Typical white space: newlines, tabs, spaces Comments:
/* … */, // … \n (C, C++, C#, Java) # … \n (Perl, Unix Shells) (* … *) (Pascal, ML) ; … \n (Scheme)
Comments generally not nested. Comments & white space ignored except that they
serve as separators of tokens.
35
FORTRAN is an exception
No reserved words:REAL IF, THEN
IF (THEN .GT. 0) IF = THEN Compiler ignores spaces (spaces removed before
tokenizing):SUM=0.0S U M = 0 . 0 (same as SUM=0.0) DO 99 I = 1,10 (loop: for i := 1 to 10 )DO 99 I = 1.10 (assignment: DO99I = 1.10 )
This means that parser must "look ahead" to identify syntax.
Lesson: don't remove white space before tokenizing.
36
Reserved words versus key wordsPascal: uses key words such as "integer", "real".
varn: integer;integer: real;
begininteger = 0.5;
"integer" has special meaning in this context
no special meaning here, you can redefine it
C: uses reserved words, such as "int", "float", "return".
Reserved words may not be redefined in a program.int n;float int; Illegal! "int" is reserved.
Reserved words are easier than key words for scanner to recognize, and easier for people to read.
37
Predefined identifiers
Predefined identifiers have special meanings, but can be redefined (although they probably shouldn’t).
Examples of predefined identifiers in Java:String, Object, System, null
in Java, you can define your own String or Object class
Predefined Identifiers are not Reserved Words
Reserved words cannot be used as the name of anything (i.e., as an identifier) except itself.
38
Java "keywords" (reserved words)
abstract continue for new switch
assert default if package synchronized
boolean do goto private this
break double implements protected throw
byte else import public throws
case enum instanceof return transient
catch extends int short try
char final interface static void
class finally long strictfp volatile
const float native super while
The Java Language Specification calls these "key words".
39
Java reserved words (cont.)
The words const and goto are reserved, even though they are not used in the Java language.
Why (do you think) Java reserves "goto" and "const" ?
40
Java reserved words (cont.)
foreach : many languages have a "foreach" statement. In C#:
double [ ] data = new double[100]; ...
foreach( double x in data ) { sum += x; }
Java 5.0 defines a new syntax of "for" to do this:
for(double x : data ) sum += x;
Q: Why did Java use "for" instead of defining a "foreach" ?
What is the disadvantage of defining "foreach(var in collection)"?
41
Java reserved words (cont.)
true and false aren't listed as "keywords" in the language spec. The spec calls them boolean literals (sect 3.10.3). Similarly, null is the null literal (sect 3.10.7).
In actuality they are reserved words!These examples prove it (compiler gives an error msg):
/* Encapsulate a constant :-) */public static true( ) { return true; }public static false( ) { return false; }
/* If true and false are mere constants, we should be allowed to redefine the names locally. */public void illogical( ) {
boolean false = (1==1); // false = trueboolean true = !false; // true = false
42
Categories of Grammar Rules
Declarations or definitions. AttributeDeclaration ::=
[ final ] [ static ] [ access ] datatype [ = expression ]{ , datatype [ = expression ] } ;
access ::= ' public ' | ' protected ' | ' private '
Statements. assignment, if, for, while, do_while
Expressions,
such as the examples in these slides.
Structures such as statement blocks, methods, and entire classes.
StatementBlock ::= '{' { Statement; } '}'
43
Parsing Algorithms (1) Broadly divided into LL and LR.
LL algorithms match input directly to left-side symbols, then choose a right-side production that matches the tokens. This is top-down parsing
LR algorithms try to match tokens to the right-side productions, then replace groups of tokens with the left-side nonterminal. They continue until the entire input has been "reduced" to the start symbol
LALR (look-ahead LR) are a special case of LR; they require a few restrictions to the LR case
Reference: Sebesta, section 4.3 - 4.5.
44
Parsing Algorithms (2) Look ahead:
algorithms must look at next token(s) to decide between alternate productions for current tokens
LALR(1) means LALR with 1 token look-ahead LL(1) means LL with 1 token look-ahead
LL algorithms are simpler and easier to visualize.
LR algorithms are more powerful: can parse some grammars that LL cannot, such as left recursion.
yacc, bison, and CUP generate LALR(1) parsers
Recursive-descent is a useful LL algorithm that "every computer professional should know" [Louden].
45
Top-down Parsing ExampleFor the input: z = (2*x + 5)*y - 7;tokens: ID = ( NUMBER * ID + NUMBER ) * ID - NUMBER ;
Grammar rules (as before):
assignment => ID = expression ;expression => expression + term
| expression - term| term
term => term * factor| term / factor| factor
factor => ( expression )| ID| NUMBER
46
Top-down Parsing Example (2)The top-down parser tries to match input to left sides. In the example, GREEN is part matched to the input so far.
ID = ( NUMBER * ID + NUMBER )* ID - NUMBER ;assignmentID = expression ID = expression - term ;ID = term - term ;ID = term * factor - term ;ID = factor * factor - term ;ID = ( expression * factor - term ;ID = ( expression + term ) * factor - term ; ID = ( term + term ) * factor - term ; ID = ( term * factor + term )* factor - term ; ID = ( factor * ID + factor )* factor - term ;ID = ( NUMBER * ID + NUMBER )* factor - term ;ID = ( NUMBER * ID + NUMBER )* ID - factor ;ID = ( NUMBER * ID + NUMBER )* ID - ID ;
47
Top-down Parsing Example (3)
Problem in example: we had to look ahead many tokens in order to know which production to use.
This isn't necessary provided that we know the grammar is parsable using LL (top-down) methods.
There are conditions on the grammar that we can test to verify this. (see: The Parsing Problem)
Later we will study the recursive-descent algorithm which does top-down parsing with minimal look-ahead.
48
The Parsing Problem
49
The Parsing Problem Top-down parsers must decide which production to use based on
the current symbol, and perhaps "peeking" at the next symbol (or two...).
Predictive parser: a parser that bases its actions on the next available token (called single symbol look-ahead).
Two conditions are necessary: [see Louden, p. 108-110]
The first condition is the ability to choose between multiple alternatives, such as: A 1 | 2 | ... | n
define First() = set of all tokens that can be the first token for any production cascade that produces symbol
then a predictive parser can be used for rule A if:
First(1) First(2) ... First(n) is empty.
50
The Parsing Problem (cont.) The second condition is the ability of the parser to detect presence
of an optional element, such as A [ ]. Can the parser detect for certain whether is present?
Example: list expr [list]. How do we know that list isn't part of expr?
define Follow( ) = set of all tokens that can follow the non-terminal some production. Use a special symbol ($) to represent the end of input if can be the end of input.
Example: Follow( factor ) = { +, -, *, /, ), $ } while Follow( term ) = { *, /, ), $ }
then a predictive parser can detect the presence of optional symbol if First( ) Follow( ) is empty.
51
Review and Thought Questions
52
Lexics vs. Syntax vs. Semantics
Division between lexical and syntactic structure is not fixed:
a number can be a token or defined by a grammar rule. Implementation can often decide:
scanners are faster parsers are more flexible error checking of number format as regex is simpler
Division between syntax and semantics is not fixed: we could define separate rules for IntegerNumber and
FloatingPtNumber , IntegerTerm, FloatingPtTerm, ... in order to specify which mixed-mode operations are allowed.
or specify as part of semantics
53
Numbers: Scan or Parse?
We can construct numbers from digits using the scanner or parser. Which is easier / better ?
Scanner: Define numbers as tokens:
number : [-]\d+
Parser: grammar rules define numbers (digits are tokens):
number '-' unsignednumber | unsignednumber
unsignednumber => unsignednumber digit | digit
digit 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9
54
Is Java 'Class' grammar context-free?
A class may have static and instance attributes.
An inner class or local class have same syntax as top-level class, but:
may not contain static members (except static constants)
inner class may access outer class using OuterClass.this
local class cannot be "public"
Does this means the syntax for a class depends on context?
55
Alternative operator notation
Some languages use prefix notation: operator comes first
expr + expr expr | * expr expr | NUMBER
Examples:
* + 2 3 4 means (2 + 3) * 4
+ 2 * 3 4 means 2 + (3 * 4)
Using prefix notation, we don't have to worry about precedence of different operators in BNF rules !