structure of programming languages lecture5eliza.newhaven.edu/lang/attach/l5-parsing.pdfso a new...
TRANSCRIPT
Structure of Programming Languages – Lecture5
CSCI 6636 – 4536
February 25, 2020
CSCI 6636 – 4536 Lecture 5. . . 1/36 February 25, 2020 1 / 36
Outline
1 Syntax and its Specification
2 Context-free LanguagesHistoryExtended BNFSyntax Diagrams
3 The Definition of Pascal
4 ParsingLL ParsersLR Parsers
5 Homework
CSCI 6636 – 4536 Lecture 5. . . 2/36 February 25, 2020 2 / 36
Syntax and its Specification
Part 1
1. Syntax and its Specification
Context-Free LanguagesExtended Backus-Naur Form
Syntax Diagrams
CSCI 6636 – 4536 Lecture 5. . . 3/36 February 25, 2020 3 / 36
Context-free Languages History
Context-Free Languages
Formally, almost all programming languages belong to the category called“context-free languages”. That is, the syntax of the language (excludingthe type matching rules) can be described by a context-free grammar.
The set of all context-free languages is identical to the set oflanguages accepted by a finite-state machine that uses a stack fortemporary storage.
We call such a machine a pushdown automaton
A context-free grammar provides a simple and precise mechanism fordescribing the way phrases in a language are built from smaller blocks.
CSCI 6636 – 4536 Lecture 5. . . 4/36 February 25, 2020 4 / 36
Context-free Languages History
Regular vs. Context-free
Regular grammars support sequences of elements, choice among a setof elements, bounded and unbounded repetition, and using asubroutine (a separate expression that defines a set of elements).
Context Free languages support all of the above, plus recursion.Recursive rules provide the ability to describe matched pairs ofelements, such as parentheses. This power is necessary to describemany parts of a programming language, including:
Program blocks with nested blocksArithmetic expressions with nested subexpressionsArray subscripts; the ‘[’ and ‘]’ must match.Begin-comment and matching end-comment markers
There is a part of programming languages that context-free grammarscannot describe: the semantics of types.
CSCI 6636 – 4536 Lecture 5. . . 5/36 February 25, 2020 5 / 36
Context-free Languages History
A Context-free Grammar G is
A finite set of nonterminal symbols, V (for vocabulary), each onerepresenting a different type of syntactic category in the language.
A finite set of keywords and punctuation, Σ (for symbol). These arecalled terminal symbols.
A finite set, R, of rules or productions of the grammar.
There must be at least one rule for every nonterminal symbol.
The starting symbol, S , is used to represent the whole sentence orprogram. It must be an element of V .
CSCI 6636 – 4536 Lecture 5. . . 6/36 February 25, 2020 6 / 36
Context-free Languages History
History: Describing Programming Language Syntax.
Context-free grammars were developed by logicians in the 50’s. Thenotation they used was mathematical in nature and not friendly tocomputer keyboards.
The first application was to analysis of natural language.
Soon afterward, they were applied to the definition of programminglanguages
Context-free grammars were then crucial in the development oflanguages and translation tools.
So a new notation called Backus Naur Form (BNF) was developed,better adapted to the character set supported by computers.
Later, it was extended to make it easier to use. The extended versionis called EBNF, and one version of EBNF is the notation commonlyused today.
CSCI 6636 – 4536 Lecture 5. . . 7/36 February 25, 2020 7 / 36
Context-free Languages Extended BNF
The syntax for EBNF itself
EBNF is a notation for writing context-free grammars.We say it is a metalanguage, that is, a language for describing languages.
Nonterminal symbols will be written in non-bold type and/or enclosedin < . . . >.
Terminal symbols will be written in boldface and/or enclosed in‘single quotes’.
Production rules. The nonterminal being defined is written at the left,followed by an “=” sign (which we will pronounce as “becomes”).After this is a set of options, which define how the nonterminal canbe expanded. The rule extends up to but does not include the “;” atthe end.
When a nonterminal is expanded it is replaced by one of the optionsfrom its definition.
Blank spaces between the “=” and the “;” are ignored.
CSCI 6636 – 4536 Lecture 5. . . 8/36 February 25, 2020 8 / 36
Context-free Languages Extended BNF
Syntax for EBNF Production Rules
Alternatives are separated by vertical bars.This indicates that an ‘s’ may be replaced by an ‘a’ or a ‘bc’:
s ::= a | bc .
Parentheses may be used to indicate grouping. For example, thisindicates that an ‘s’ may be replaced by an ‘ad’ or a ‘bcd’.
s ::= ( a | bc ) d .
Something enclosed in square brackets is optional. For example, thisrule says that an ‘s’ may be replaced by an ‘ad’ or simply by a ‘d’:
s ::= [a] d .
Zero or more repetitions of a unit is indicated by enclosing the unit incurly braces. This rule says that an ‘s’ may be replaced by a ‘d’, an‘ad’, an ‘aad’, or a string of any number of ‘a’s followed by a single‘d’ and one or more ‘b’s.
s ::= {a} d b {b} .
CSCI 6636 – 4536 Lecture 5. . . 9/36 February 25, 2020 9 / 36
Context-free Languages Extended BNF
Example: A context-free grammar and EBNF notation.
This grammar defines the part of Pascal arithmetic expressions thatapplies to primitive types.
V is {addOp, multiplyOp, relationalOp, expression, simpleExpr, sign,term, factor, variableAccess, unsignedConstant}The starting symbol is expression .
Σ is not, =, and all 15 operators
R is this set of rules:1. relationalOp = <, <=, >, >=, =, <> ;2. addOp = +, -, or, xor ;3. multiplyOp = * | / | div | mod | and ;
4. expression = simpleExpr { relationalOp simpleExpr };5. simpleExpr = [ sign ] term { addOp term } ;6. term = factor {multiplyOp factor } ;7. factor = variableAccess | unsignedConstant | (expression) | (not factor);
CSCI 6636 – 4536 Lecture 5. . . 10/36 February 25, 2020 10 / 36
Context-free Languages Extended BNF
Applying the grammar for expressions.
Consider the expression 3 ∗ (x + 2) < limit
a. x is a variableAccess, so is limit.
b. A variableAccess is a factor and a factor is a term.
c. x+2 is a simpleExpr: term + term with no sign.
d. A simpleExpr is an expression, and with the parenthesis, the wholeunit is a factor.
e. 3*(x+2) is a term: factor multiplyOp factor .
f. So it is also a simpleExpr .
g. 3 ∗ (x + 2) < limit, is an expression: simpleExpr relationalOpsimpleExpr .
Rule 7 contains a recursive reference to rule 4.Steps d and g show nesting, the result of applying that recursive definition.
CSCI 6636 – 4536 Lecture 5. . . 11/36 February 25, 2020 11 / 36
Context-free Languages Extended BNF
Example: An EBNF grammar for AS.
This grammar defines a nonsense language called AS.
V is {S ,A} and the starting symbol is S
Σ is x ( )
R: On the left is the boldface presentation of the rules; on the right isthe machine-compatible version that uses angle brackets aroundnon-terminals.
1. S ::= A . 1. S = < A > ;2. A ::= ( S ) . 2. A = ( < S > ) ;3. A ::= ASA . 3. A = < A >< S >< A > ;4. A ::= x . 4. A = x ;
Rules 2, 3, and 4 can be consolidated to : A ::= ASA | ( S ) | x .
CSCI 6636 – 4536 Lecture 5. . . 12/36 February 25, 2020 12 / 36
Context-free Languages Extended BNF
Describing Programming Language Syntax
This grammar illustrates how matched and nested symbols are generated.
Start a derivation by writing down the starting symbol.
Apply rules to nonterminal symbols, in any order, to reach your goal.
Stop when all the nonterminals are gone.
Any rule that introduces a left-paren must also introduce a matchingright-paren.
The grammar is recursive so that parenthesized units can be producedinside other pairs of parentheses.
CSCI 6636 – 4536 Lecture 5. . . 13/36 February 25, 2020 13 / 36
Context-free Languages Extended BNF
Example: What strings are in the AS language?
Following are a few examples of AS derivations.
S → A → x .
S → A → (S) → (A) → (x) .
S → A → ASA → xSx → xAx → xxx .
S → A → ASA → (S)Sx → (A)Ax → (x)xx .
CSCI 6636 – 4536 Lecture 5. . . 14/36 February 25, 2020 14 / 36
Context-free Languages Extended BNF
Example: An EBNF Grammar for Nonsense.
This grammar includes a loop and an optional element.
The starting symbol is S .
Nonterminal symbols are: S, stop
Terminal symbols are: A B C D E –
Productions:S ::= E { – E } B stopS ::= [ stop ] A stopstop ::= C | D
We use this grammar to generate four Nonsense sentences:S
A stop
A D
S
stop A stop
D A D
S
E B stop
E B C
S
E - E - E B stop
E - E - E B D
CSCI 6636 – 4536 Lecture 5. . . 15/36 February 25, 2020 15 / 36
Context-free Languages Syntax Diagrams
Syntax Diagrams
An alternative formal definition metalanguage was developed for Pascal; itis often called “railroad diagrams”. It has the same elements as EBNF,but they are presented in a 2D graphic format:
Terminal symbols are boldface and enclosed in ovals. Nonterminalsymbols are written in non-bold type.
Production rules: the nonterminal being defined is written at the left,followed by an arrow.
Alternatives are shown by branches in the arrow.
To expand a nonterminal, follow some branch of the arrow to its endat the right.
An optional element is handled by an empty arrow branching aroundit.
Repetitions of a unit are shown by the arrow looping back on itself.
CSCI 6636 – 4536 Lecture 5. . . 16/36 February 25, 2020 16 / 36
Context-free Languages Syntax Diagrams
AS in Syntax Diagrams
We have an alternative and a recursive rule.
SAASAAAx(S)(S)x(A)(A)x(x)((S))x(x)((A))x(x)((x))x
SA( S ) ( A )( x )
SAx
)(
x
S AA A S A
S
CSCI 6636 – 4536 Lecture 5. . . 17/36 February 25, 2020 17 / 36
Context-free Languages Syntax Diagrams
Nonsense in Syntax Diagrams
Here is a looping rule and an optional element.
S
stop
stop
D
C
E B
A
stop
-SA stopA D
SE B stopE B C
Sstop A stopC A D
SE-E-E B stopE-E-E B D
CSCI 6636 – 4536 Lecture 5. . . 18/36 February 25, 2020 18 / 36
The Definition of Pascal
Pascal Syntax
Here are large parts of the definition of Pascal.Productions involving type declarations have been omitted.
EBNF definition of a Pascal program
Syntax Diagrams for Pascal expressions
CSCI 6636 – 4536 Lecture 5. . . 19/36 February 25, 2020 19 / 36
The Definition of Pascal
The Syntax for part of Pascal.
program ::= <program-heading> ; <program-block> . .
program-heading ::=program <identifier> [ ( <program-parameters> ) ].
program-parameters ::= <identifier-list> .
identifier-list ::= <identifier> { , <identifier> } .
program-block ::= <block> .
block ::= <label-declaration-part> <constant-declaration-part><type-declaration-part> <variable-declaration-part><procedure-and-function-declaration-part><statement-part>.
variable-declaration-part ::= [ var{<identifier-list> :<typename>; }].
CSCI 6636 – 4536 Lecture 5. . . 20/36 February 25, 2020 20 / 36
The Definition of Pascal
Continuing with Pascal.
statement-part ::= <compound statement> .
compound-statement ::= begin <statement-sequence> end.
statement-sequence ::= <statement> { ; <statement> } .
statement ::= [ <label> : ]( <simple-statement> | <structured-statement> ).
simple-statement ::=<empty-statement> | <assignment-statement> |<procedure-call-statement> | <goto-statement> .
structured-statement ::=<compound-statement> | <conditional-statement> |<repetitive-statement> | <with-statement> .
CSCI 6636 – 4536 Lecture 5. . . 21/36 February 25, 2020 21 / 36
The Definition of Pascal
Simple Statements in Pascal.
empty-statement ::= .
assignment-statement ::=( <variable-reference> | <function-name> ) ’:=’ <expression> .
procedure-call-statement ::= <IO-procedure-statement> |<procedure-identifier> [ ( <actual-parameter-list> ) ] .
IO-procedure-statement := read <read-parameter-list >| readln <readln-parameter-list> |write <write-parameter-list>|writeln <writeln-parameter-list> .
goto-statement ::= goto <label> .
label-declaration ::= [ label <label> { , <label> } ] .
label ::= <digit-sequence> .
CSCI 6636 – 4536 Lecture 5. . . 22/36 February 25, 2020 22 / 36
The Definition of Pascal
Conditionals in Pascal.
compound-statement := begin <statement> { ; <statement>} end.
conditional-statement ::= <if-statement> | <case-statement> .
if-statement ::= if <boolean-expression>then <statement> [<else-part> ] .
else-part ::= else <statement> .
case-statement::= case <case-index> of<case-list-element> { ; <case-list-element> } [; ] end .
case-list-element ::= case-constant-list : <statement> .
case-constant-list ::= case-constant { , case-constant } .
case-constant ::= constant .
CSCI 6636 – 4536 Lecture 5. . . 23/36 February 25, 2020 23 / 36
The Definition of Pascal
Loops and With in Pascal.
repetitive-statement ::=<repeat-statement> | <while-statement> | <for-statement> .
repeat-statement ::= repeat <statement-sequence> until<boolean-expression> .
while-statement ::= while <boolean-expression> do <statement> .
for-statement ::= for <control-variable> := <initial-value> [ to |downto ] <final-value> do <statement> .
with-statement ::= with <record-variable-list> do <statement> .
CSCI 6636 – 4536 Lecture 5. . . 24/36 February 25, 2020 24 / 36
The Definition of Pascal
Pascal Expressions
+
term
term
or
simple expression
+
expression
>=
simple expression
in<=simple
expression
=< <> >
CSCI 6636 – 4536 Lecture 5. . . 25/36 February 25, 2020 25 / 36
The Definition of Pascal
Pascal Expressions
function designator
( )actual parameter
,
/
term
*
factor
factor
div mod and
function identifier
CSCI 6636 – 4536 Lecture 5. . . 26/36 February 25, 2020 26 / 36
The Definition of Pascal
Pascal Expressions
expression
factor
( )
unsigned constant
not
variablefunction designator
factor
set value
CSCI 6636 – 4536 Lecture 5. . . 27/36 February 25, 2020 27 / 36
Parsing
Parsing
Parsing
Ad-hoc ParsingParsing Based on EBNF
CSCI 6636 – 4536 Lecture 5. . . 28/36 February 25, 2020 28 / 36
Parsing
Old Languages were Parsed Ad-Hoc
These comments reflect FORTRAN-IV.
The language itself was created by collecting a lot of features.Everything about it was non-uniform and full of special cases. Forexample, there were half a dozen ways to punctuate a series of items.Syntax diagrams occupied 40 pages, versus 6 for Pascal.
Everything was made more difficult because the language definitionsaid that spaces were ignored.
A FORTRAN-IV parser was basically hand-built. It would look at thenext source-code character and try to figure out what it might be,given the current context.
This is a famous FORTRAN parsing problem that illustrates what iswrong with ad-hoc design: DO 200 I=1,10,2
Since DO200I is a legal variable name, we can’t know whether this isan assignment statement or a DO loop until the first comma-token isfound.
CSCI 6636 – 4536 Lecture 5. . . 29/36 February 25, 2020 29 / 36
Parsing
Ad-Hoc Languages Today
We can list several current languages with no rhyme or reason in thedesign:
The C-shells:bash, tcsh, and other UNIX shell languages and scripts.
Perl
TeX and LATeX
These are hard to learn and hard to write correctly. They are parsed andinterpreted in an ad-hoc manner. Often the semantics are complicated andhard to understand.
CSCI 6636 – 4536 Lecture 5. . . 30/36 February 25, 2020 30 / 36
Parsing LL Parsers
Recursive Descent Parsing: LL(k) languages
A recursive descent parser is a top-down parser built from a set ofmutually-recursive and/or non-recursive procedures.
Each procedure implements one of the rules of the grammar. Thusthe structure of the parser closely mirrors that of the grammar itrecognizes.
A linear-time parser can be built for any language in which alook-ahead of k input symbols allows the parser to decide whichproduction to use next. (k is a non-negative integer constant).
An ambiguous grammar cannot be parsed this way.
Also, the grammar cannot contain left-recursive rules, of the formexpr :: expr + term. However, right-recursive rules, of the formexpr :: term + expr are not a problem.
CSCI 6636 – 4536 Lecture 5. . . 31/36 February 25, 2020 31 / 36
Parsing LL Parsers
Top-down Parsing
The recursive descent parser starts with the starting symbol of thegrammar and the beginning of the tokenized source-code file.
It then attempts to find a match for the left end of one of thepossible definitions of the starting symbol.
If the left end is found, it calls itself recursively, with the rest of thesource code, to find a match for the next part of the production.
This process works its way through the source code and down the listof productions. It will terminate successfully when the inner, recursivecalls have all terminated and a match is found for the rightmostelement in the original starting production.
If it fails at any point, it has recognized a syntax error.
CSCI 6636 – 4536 Lecture 5. . . 32/36 February 25, 2020 32 / 36
Parsing LL Parsers
Recursive Descent Parsing with Backtracking
A less-efficient top-down method exists for grammars that do notmeet the criteria above.
The parser works as above, but if it fails at any point, it willbacktrack and try another option from the current production.
This process will terminate when it succeeds or when possibilitieshave been attempted.
Parsers that use recursive descent with backtracking may requireexponential time.
CSCI 6636 – 4536 Lecture 5. . . 33/36 February 25, 2020 33 / 36
Parsing LR Parsers
LR Parsers are bottom up
An LR(k) parser analyzes the source code from left to right with alook-ahead of k input tokens.
An LR parser starts with the leaves of the parse tree (the tokens) andattempts to build up from there to the starting symbol.
It detects a syntactic error when the input does not conform to thegrammar.
The syntax of many programming languages can be defined by agrammar that is LR(1), or close to being so, and for this reason LRparsers are often used in compilers.
LR parsers are difficult to produce by hand and they are usuallyconstructed by a parser generator also called a compiler-compiler.
CSCI 6636 – 4536 Lecture 5. . . 34/36 February 25, 2020 34 / 36
Parsing LR Parsers
Compiler Compilers
A compiler compiler is a program whose input is a description of thelanguage and whose output is a compiler. Yacc is a well-known Unixcompiler compiler. The Gnu version is called bison. The inputs are:
A formal definition of the language’s lexical structure (expressed inEBNF).
A formal definition of preprocessing directives, if any, and theircorresponding actions.
The EBNF definition of the language syntax, given that tokens havealready been identified.
The code to be generated for each fully-parsed nonterminal symbol inthe grammar.
The compiler compiler produces the compiler that will build a parse tree(front end) and transmute the tree into the corresponding object code(back end).
CSCI 6636 – 4536 Lecture 5. . . 35/36 February 25, 2020 35 / 36
Homework
Homework 5: 12 points1 (2) Generate a legal string of Nonsense that is longer than 10
terminals. Refer to the grammar on pages 13 and 18.2 (2) Write an EBNF rule that defines a FORTH infinite loop
(begin...repeat). This is a loop with no while inside it. Look up theprecise definition in the FORTH reference spreadsheet. If you cannotfind it, ask me. Invent any nonterminal symbols that you like, but use< word > and < words > to represent one or more symbols.
3 (4) Look up the syntax for the FORTH counted loop and draw asyntax diagram for it. Diagram both forms, one where you add to theloop variable on each iteration and the other where you subtract fromit. The termination conditions are different and slightly tricky.
4 (4) Write a FORTH function that uses an if statement and a countedloop. Take one parameter (a number) off the stack. If it is less than3, print an error comment. Otherwise, print the word ”hooray” thatmany times. Turn in the code and the results by using cut-and-paste.Please no screen shots of a black-background screen.
CSCI 6636 – 4536 Lecture 5. . . 36/36 February 25, 2020 36 / 36