compiler construction - bgucomp191/wiki.files/compiler... · 2018-10-31 · refresher last week, we...
TRANSCRIPT
Compiler Construction
Mayer Goldberg \ Ben-Gurion University
October 31, 2018
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 1 / 175
Chapter 2Goals
▶ The pipeline of the compiler▶ Introduction to syntactic analysis▶ Further steps in ocaml
Agenda▶ The pipeline
▶ Syntactic analysis▶ Semantic analysis▶ Code generation
▶ The compiler for the course▶ The language of S-expressions▶ More ocaml
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 2 / 175
Refresher
Last week, we discussed▶ The interpreter as an evaluation function▶ The compiler as a translator & optimizer▶ We explored the relations between interpretation & compilation
This was a rather high-level view of the areaWe now wish to consider compilation as a large software-project
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 3 / 175
Compilation as translation
A compiler translates between languages:▶ Understanding the syntax of the program
▶ What kinds of statements & expressions there are▶ What are the various parts of these statements & expressions▶ Are they syntactically correct
▶ Understanding the meaning of the program▶ Do the operations make sense?
▶ What are their types?▶ Are they used in accordance with their types?
▶ On what data is the program acting?▶ What is returned?
▶ Once we understand the syntax and meaning of a sentence, wecan render it in another language
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 4 / 175
The pipeline of the compiler
Since the 1950’s, the standard architecture for compilers has been apipeline:
▶ Syntactic analysis▶ Scanning▶ Parsing
▶ Reading▶ Tag-Parsing
▶ Semantic analysis▶ Code generation
Scanner Reader Tag-Parser Semantic Analyser
Code Generator
chars tokens sexprs ASTs ASTsasm /
mach lang
Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 5 / 175
The pipeline of the compiler
The stages in the compiler pipeline are distinguished by▶ Function: What they do▶ Dependencies: Which stages depend on which other▶ Complexity: How difficult it is to perform a stage
In programming languages:▶ Understanding syntax is relatively straightforward (unlike in
natural language)▶ Understanding meaning is way harder than understanding syntax▶ Meaning is built upon syntax (in natural languages, syntax &
meaning can be inter-dependent)▶ Code generation is relatively straightforward (template-based)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 6 / 175
The pipeline of the compiler
OptimizationsHow optimizations fit into the pipeline of the compiler:
▶ We distinguish [at least] two levels of optimizations:▶ High-level optimizations (closer to the source language) would
go into the semantic analysis phase▶ Low-level optimizations (closer to assembly language) would go
into the code generation phaseThis distinction can be fuzzy. Some make it fuzzier withintermediate-level optimizations
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 7 / 175
An example of a high-level optimization
Suppose the compiler can know that the value of n is 0 whenreaching the following statement:
if (n == 0) {foo();
}else {goo(n);
}
Then an obvious optimization to perform would be to eliminate theif-statement with:
foo();
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 8 / 175
How has the code improved:
Beforeif (n == 0) {foo();
}else {goo(n);
}
Afterfoo();
What was gained▶ The test during run-time has been eliminated▶ The code is shorter▶ Possibly lead to further, cascading optimizations
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 9 / 175
An example of a low-level optimization
Before:
mov rax, 1mov rax, 2
After:
mov rax, 2
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 10 / 175
How has the code improved:
Beforemov rax, 1mov rax, 2
Aftermov rax, 2
What was gained▶ Saved 1 cycle▶ Made the code smaller▶ If this code appears within a loop, gains shall be multiplied…
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 11 / 175
The pipeline of the compiler
Basic concepts▶ Concrete syntax▶ Abstract syntax▶ Abstract Syntax-Tree (AST)▶ Token▶ Delimiter▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 12 / 175
Concrete syntax (continued)The concrete syntax of a computer program is a textual stream ofcharacters:
▶ It’s one-dimensional▶ Lacking in structure
▶ No nesting▶ No sub-expressions
▶ Difficult to work with▶ Difficult to access parts▶ Difficult to determine correctness
▶ Contains redundancies (spaces, comments, etc)(define fact (lambda (n) (if (zero? n) 1(* n (fact (- n1))))))
Think of▶ A text file▶ Characters typed at the prompt
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 13 / 175
The pipeline of the compiler
Basic concepts🗸 Concrete syntax▶ Abstract syntax▶ Abstract Syntax-Tree (AST)▶ Token▶ Delimiter▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 14 / 175
Abstract syntax
The abstract syntax of a computer program is a tree-likedata-structure. It is:
▶ Multi-dimensional▶ Conveys structure
▶ Nested▶ Recursive (following the inductive definition of the grammar)
▶ Easier to work with than the concrete syntax▶ Easier to access parts▶ Easier to verify correctness▶ Some syntactic correctness issues have already been decided
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 15 / 175
The pipeline of the compiler
Basic concepts🗸 Concrete syntax🗸 Abstract syntax▶ Abstract Syntax-Tree (AST)▶ Token▶ Delimiter▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 16 / 175
Abstract Syntax-Tree (AST)
Notice▶ The AST is a tree▶ No text▶ No parenthesis▶ No spaces, tabs, newlines▶ The structure is evident▶ Easy to find
sub-expressions▶ Easy to determine
correctness▶ Easier to analyze,
transform, and compile
The AST of fact
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 17 / 175
Concrete vs Abstract Syntax▶ Parsing: going from concrete syntax to abstract syntax▶ Parser: the tool that performs parsing
Concrete Syntax▶ Lacks structure▶ Prone to errors▶ Hard to delimit
sub-expressions▶ Inefficient to work with▶ Concrete Syntax can be
avoided▶ Visual languages▶ Structure/syntax editors
Abstract Syntax▶ Has structure▶ Many kinds of errors are
avoided▶ Sub-Expressions are readily
accessible▶ Efficient to work with
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 18 / 175
The pipeline of the compiler
Basic concepts🗸 Concrete syntax🗸 Abstract syntax🗸 Abstract Syntax-Tree (AST)▶ Token▶ Delimiter▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 19 / 175
Tokens
▶ The smallest, meaningful, lexical unit in a language▶ Described using regular expressions▶ Identified using DFA (a very simple model of computation)▶ Examples
▶ Numbers▶ [Non-nested] Strings▶ Names (variables, functions)▶ Punctuation
▶ Cannot handle nesting of any kind:▶ Parenthesized expressions▶ Nested comments▶ etc.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 20 / 175
Tokens
▶ Scanning: going from characters into tokens▶ Scanner: the tool that performs scanning▶ Scanner generator: the tool that takes definitions for tokens,
using regular expressions (and callback functions), and returns ascanner
▶ Examples of scanner-generators: lex, flex
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 21 / 175
The pipeline of the compiler
Basic concepts🗸 Concrete syntax🗸 Abstract syntax🗸 Abstract Syntax-Tree (AST)🗸 Token▶ Delimiter▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 22 / 175
Delimiters
▶ Delimiters are characters that separate tokens▶ In most languages spaces, parentheses, commas, semicolons,
etc., are all delimiters▶ Some tokens must be separated by delimiters
▶ Two consecutive numbers, two consecutive symbols, etc.▶ Some tokens do not need to be separated by delimiters
▶ Two consecutive strings, an open parenthesis followed by anumber, etc.
▶ Delimiters are language-dependent
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 23 / 175
The pipeline of the compiler
Basic concepts🗸 Concrete syntax🗸 Abstract syntax🗸 Abstract Syntax-Tree (AST)🗸 Token🗸 Delimiter▶ Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 24 / 175
Whitespace
▶ Whitespace refers to characters that▶ Have no graphical representation▶ Occur between tokens
▶ Spaces within strings are not whitespaces…▶ Serve no syntactic purpose other than as delimiters and for
indentation▶ Whitespace is language-dependent
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 25 / 175
Delimiters in various languages
C & SchemeSpaces, tab, newlines, carriage returns, form feeds are examples ofwhitespaces
JavaLiteral newline characters may not occur inside a literal string (mustuse \n). Otherwise, similar to C & Scheme.
PythonLeading tabs are not whitespaces because they have a clear syntacticfunction: They denote nesting level.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 26 / 175
Concrete vs Abstract syntax
Artifacts of the Concrete Syntax▶ Delimiters & whitespaces▶ Parentheses, brackets, braces, and other grouping and nesting
mechanisms (e.g., begin...end)Re-examine the concrete and abstract syntax for the factorialfunction, and notice what’s gone!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 27 / 175
Concrete vs Abstract syntax (continued)
The concrete syntax(define fact (lambda (n)(if (zero? n) 1(* n (fact (- n1))))))
The abstract syntax
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 28 / 175
The pipeline of the compiler
Basic concepts🗸 Concrete syntax🗸 Abstract syntax🗸 Abstract Syntax-Tree (AST)🗸 Token🗸 Delimiter🗸 Whitespace
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 29 / 175
More on parsing
To parse computer programs in a given language, we rely on:▶ Grammars with which to express the syntax of the language
▶ There are different kinds of grammars (CFG, CSG, two-level,etc)
▶ There are different languages for expressing syntax in agrammar (e.g., BNF)
▶ Algorithms for parsing programs as per kind of grammar▶ Techniques (e.g., parsing combinators, DCGs)
Parser generator: Takes a description of the grammar for a language,and generates a parser. For example, yacc, bison, nearly, etc.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 30 / 175
The pipeline of the compiler (continued)
Scanning▶ Going from characters to tokens▶ Identifying & grouping characters into tokens for words,
numbers, strings, etc.▶ Parsing over tokens is more efficient than parsing over
characters☞ As the parser examines various ways to parse the code, the
parser can avoid re-identifying and re-building complex tokenssuch as numbers, strings, etc
Scanner Reader Tag-Parser Semantic Analyser
Code Generator
chars tokens sexprs ASTs ASTsasm /
mach lang
Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 31 / 175
The pipeline of the compiler (continued)Reading
▶ In LISP/Prolog, the parser is split into two components:▶ The reader, or the parser for the data language▶ The tag-parser, or the parser for the source code
▶ In LISP/Scheme/Racket/Clojure/etc, the abstract syntax forthe data is the concrete syntax for the code
▶ In Prolog, the abstract syntax for the data is the abstract syntaxfor the code
▶ Prolog is the programming language with the most powerfulcapabilities of reflection, i.e., code examining and working withitself.
Scanner Reader Tag-Parser Semantic Analyser
Code Generator
chars tokens sexprs ASTs ASTsasm /
mach lang
Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 32 / 175
The pipeline of the compiler (continued)Reading — Summary
▶ In programming languages in which the syntax of code is not apart of the syntax of data, concrete syntax is given as a streamof characters
▶ In programming languages in which the syntax of code is part ofthe syntax of data, things are a bit more complex:
▶ The concrete syntax of data is a stream of characters▶ The concrete language of code is the abstract syntax of the
data▶ In Scheme, the language of data is called S-expressions (more on
this, later)
Scanner Reader Tag-Parser Semantic Analyser
Code Generator
chars tokens sexprs ASTs ASTsasm /
mach lang
Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 33 / 175
The pipeline of the compiler (continued)
Tag-Parsing▶ The tag-parser takes sexprs and returns [ASTs for] exprs▶ Languages other than from the LISP & Prolog families do not
split parsing into a reader & tag-parser▶ In such languages, parsing goes directly from tokens to [ASTs
for] expressions
☞ Every valid program ”used to be” [i.e., before tag-parsing] avalid sexpr
☞ Not every valid sexpr is a valid program!
Scanner Reader Tag-Parser Semantic Analyser
Code Generator
chars tokens sexprs ASTs ASTsasm /
mach lang
Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 34 / 175
The pipeline of the compiler (continued)
QuestionA parser should:👎 Perform optimizations👎 Evaluate expressions👎 Raise type-mismatch errors👎 Find potential runtime errors (null-pointer dereferences,
array-index errors, etc.)👍 Validate the structure of input programs against a syntactic
specification
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 35 / 175
The pipeline of the compiler (continued)
QuestionUsing an AST, it is impossible to:👎 Perform code reformatting/beautification/style-checking👎 Perform optimizations👎 Output a new program which is semantically equivalent to the
input program (code generation)👎 Refactor the input program👍 Generate a list of all the comments in the code
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 36 / 175
The pipeline of the compiler (continued)
Semantic Analysis▶ Annotate the ASTs▶ Compute addresses▶ Annotate tail-calls▶ Type-check code▶ Perform optimizations
Scanner Reader Tag-Parser SemanticAnalyser
Code Generator
chars tokens sexprs ASTs ASTsasm /
mach lang
Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 37 / 175
The pipeline of the compiler (continued)
Code Generation▶ Generate a stream of instructions in
▶ assembly language▶ machine language
▶ Build executable▶ some other target language…
▶ Perform low-level optimizations
Scanner Reader Tag-Parser Semantic Analyser
CodeGenerator
chars tokens sexprs ASTs ASTsasm /
mach lang
Parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 38 / 175
The compiler for the course
Our compiler project▶ Written in ocaml▶ Support a subset of Scheme + extensions▶ Support two, simple optimizations▶ Compile to x86/64▶ Run on linux
What our project shall lack▶ Support for the full language of Scheme▶ Support for garbage collection▶ Self-compilation
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 39 / 175
S-expressions
▶ We’re going to learn about syntax by studying the syntax ofScheme
▶ After all, we’re writing a Scheme compiler…▶ It’s relatively simple, compared to the syntax of C, Java,
Python, and many other languages▶ It comes with some interesting twists
▶ Scheme comes with two languages:▶ A language for code▶ A language for data
and there’s a tricky relationship between the two.▶ The key to understanding the syntax of Scheme, is to think
about data
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 40 / 175
The Language of Data
What is a language of data? — A language in which to▶ Describe arbitrarily-complex data
▶ Possibly multi-dimensional, deeply nested▶ Polymorphic
▶ Access components easily and efficiently
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 41 / 175
The Language of Data (continued)
Today many languages of data are known:▶ S-expressions (the first: 1959)▶ Functors (1972)▶ Datalog (1977)▶ SGML (1986)▶ MS DDE (1987)▶ CORBA (1991)▶ MS COM (1993)▶ MS DCOM (1996)▶ XML (1996)▶ JSON (2001)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 42 / 175
The Language of Data (continued)
What makes S-expressions and Functors unique?▶ They’re the first… 😉▶ They’re supported natively, as part of specific programming
languages▶ S-expressions are supported by LISP-based languages, including
Scheme & Racket▶ Functors are supported by Prolog-based languages
▶ The language of programming is a [strict] subset of the languageof data
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 43 / 175
The Language of Data (continued)Think for a moment about the language of XML:<something>...</something>, etc
▶ It’s not supported natively by any programming language▶ Most modern languages (Java, Python, etc) support it via
libraries▶ No programming language is written in XML:<package name="Foo"><class name="Foo"><method name="goo">
...</method>
</class></package>This would be cumbersome, and weird!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 44 / 175
The Language of Data (continued)
However, if some programming language both▶ Supported XML as its data language▶ Were itself written in XML
Then a parser for XML could also read programs written in thatlanguage:
▶ Writing interpreters, compilers, and other language-tools wouldhave been much simpler!
▶ Reflection (code examining code) would be simple
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 45 / 175
The Language of Data (continued)
This is the case with S-expressions:▶ They are the data language for LISP-based languages, including
Scheme▶ LISP-based languages are written using S-expressions▶ Writing interpreters and compilers in LISP-based languages is
much simpler than in other languages▶ Computational reflection was invented in LISP!▶ This is the real reason behind all these parentheses in Scheme:
▶ A very simple language▶ Supports core types: pairs, vectors, symbols, strings, numbers,
booleans, the empty list, etc.▶ A syntactic compromise that is great for expressing both code
and data
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 46 / 175
S-expressions (continued)
Back to S-expressions▶ S-expressions were invented along with LISP, in 1959▶ S-expressions stand for Symbolic Expressions▶ The term is intended to distinguish itself from numerical
expressions▶ Before LISP (and long after it was invented), most computation
concerned itself with numbers▶ Computers languages were great at ”crunching numbers”, but
working with non-numeric data types was difficult▶ String libraries were non-standard and uncommon▶ Polymorphic data was unheard of▶ Nested data structured needed to be implemented from scratch,
usually with arrays of characters and/or integers…
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 47 / 175
S-expressions (continued)
Back to S-expressionsThen S-expressions were invented as part of a very dynamicprogramming language (LISP):
▶ Working with data structures became considerably simpler▶ Trivially allocated (no pointer arithmetic)▶ Polymorphic (lists of lists of numbers and strings and vectors of
booleans and…)▶ Easy to access sub-structures (no pointer arithmetic)▶ Easy to modify (in an easygoing, functional style)▶ Easy to redefine▶ Automatically deallocated (garbage collection)
▶ Treating code as data became considerably simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 48 / 175
S-expressions (continued)
Several fields were invented using LISP and its tools:▶ Symbolic Mathematics (Macsyma, a precursor to Wolfram
Mathematica)▶ Artificial Intelligence▶ Computer adventure game generation languages (MDL, ZIL)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 49 / 175
S-expressions (continued)
Definition: S-expressionsThe language is made up of
▶ The empty list: ()▶ Booleans: #f, #t▶ Characters: #\a, #\Z, #\space, #\return, #\x05d0, etc▶ Strings: "abc", "Hello\nWorld\t\x05d0;hi!", etc▶ Numbers: -23, #x41, 2/3, 2-3i, 2.34, -2.34+3.5i▶ Symbols: abc, lambda, define, fact, list->string▶ Pairs: (a . b), (a b c), (a (2 . #f) "moshe")▶ Vectors: #(), #(a b ((1 . 2) #f) "moshe")
Traditionally, non-pairs are known as atoms.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 50 / 175
S-expressions (continued)
Proper & improper lists▶ The name LISP comes from LISt Processing.▶ In fact, LISP has no direct support for lists.▶ LISP has ordered pairs
▶ Ordered pairs are created using cons▶ The first and second projections over ordered pairs are car and
cdr. For all x, y:▶ (car (cons x y)) ≡ x▶ (cdr (cons x y)) ≡ y▶ The ordered pair of x and y can be written as (x . y)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 51 / 175
S-expressions (continued)
The dot rulesTwo rules govern how ordered pairs are printed:
▶ Rule 1: For any E, the ordered pair (E . ()) is printed as (E),which looks like a list of 1 element.
▶ Rule 2: For any E1, E2, …, the ordered pair (E1 . (E2 — )) isprinted as (E1 E2 — )
▶ These rules just effect how pairs are printed▶ These rules give us a canonical representation for pairs
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 52 / 175
S-expressions (continued)
Example▶ The pair (a . (b . c)) is printed as (a b . c)
SYMBOL
a
SYMBOL
b
SYMBOL
c
PAIR
CAR CDR
PAIR
CAR CDR
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 53 / 175
S-expressions (continued)Example
▶ The pair ((a . (b . ())) . ((c . (d . ())))) isprinted as ((a b) (c d))
SYMBOL
a
SYMBOL
bNIL
PAIR
CAR CDR
PAIR
CAR CDR
SYMBOL
c
SYMBOL
dNIL
PAIR
CAR CDR
PAIR
CAR CDRNIL
PAIR
CAR CDR
PAIR
CAR CDR
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 54 / 175
S-expressions (continued)
▶ Lists in Scheme can come in two forms, proper lists andimproper lists.
▶ When we just speak of lists, we usually mean proper lists.▶ Most of the list processing functions (length, map, etc) all work
with proper lists.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 55 / 175
S-expressions (continued)
Proper lists▶ Proper lists are nested ordered pairs the rightmost cdr of which
is the empty list (aka nil)▶ Testings for pairs is cheap, and is done by means of the builtin
predicate pair?▶ Testing for lists is expensive, since it traverses nested, ordered
pairs, until it reaches their rightmost cdr. This is done bymeans of the builtin predicate list?
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 56 / 175
S-expressions (continued)
Proper listsHere’s a definition for list?:(define list?(lambda (e)
(or (null? e)(and (pair? e)
(list? (cdr e))))))
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 57 / 175
S-expressions (continued)
Improper lists▶ Pairs that are not proper lists are improper lists.▶ Improper lists end with a rightmost cdr that is not nil▶ List-processing procedures such as length, map, etc., do not
work over improper lists▶ There is no builtin procedure for testing improper lists, but it
could be written as follows:
(define improper-list?(lambda (e)
(and (pair? e)(not (list? (cdr e))))))
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 58 / 175
S-expressions (continued)
Self-evaluating formsBooleans, numbers, characters, strings are self-evaluating forms. Youcan evaluate them directly at the prompt:
> 123123> "abc""abc"> #t#t> #\m#\m
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 59 / 175
S-expressions (continued)
Other formsThe empty list, pairs, and vectors cannot be evaluated directly at theprompt:
▶ Entering an empty list or a vector or an improper list at theprompt generates a run-time error.
▶ Entering a symbol at the prompt causes Scheme to attempt toevaluate a variable by the same name
▶ Entering a proper list, that is not the empty list, at the promptcauses Scheme to attempt to evaluate an application:> (a b c)
Exception: variable b is not boundType (debug) to enter the debugger.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 60 / 175
S-expressions: quote & friends
To evaluate S-expressions that are not self-evaluating, we must usethe form quote:
▶ The special form quote can be written in two ways:▶ (quote <sexpr>)▶ '<sexpr>
Both forms are equivalent▶ When you type abc at the Scheme prompt, you’re evaluating
the variable abc▶ When you type 'abc at the Scheme prompt, you’re evaluating
the literal symbol abc▶ The value of the literal symbol abc is just itself, which is why
when you type 'abc at the Scheme prompt, you get back abc
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 61 / 175
S-expressions: quote & friends
▶ When you type () at the Scheme prompt, you’re evaluating anapplication with no function and no arguments! This is asyntax-error!
▶ When you type '() at the Scheme prompt, you’re evaluating aliteral empty list
▶ The value of the literal empty list is just itself, which is whywhen you type '() at the Scheme prompt, you get back ()
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 62 / 175
S-expressions: quote & friends▶ When you type (a b c) at the Scheme prompt, you’re
evaluating the application of the procedure a to the arguments band c, which are variables
▶ When you type '(a b c) at the Scheme prompt, you’reevaluating the literal list (a b c)
▶ The value of the literal list (a b c) is just (a b c), which iswhy when you type '(a b c) at the Scheme prompt, you getback (a b c).
▶ Quoting a self-evaluating S-expression is possible, andredundant:> '22> (+ '2 '3)5
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 63 / 175
S-expressions: quote & friends
So what does quote do?▶ The quote form does nothing
▶ It is not a procedure▶ It doesn’t take an argument▶ It delimits a constant, literal S-expressions
▶ The syntactic function of quote in Scheme is the same as thesyntactic function of braces { ... } in C in defining literaldata:const int A[] = {4, 9, 6, 3, 5, 1};
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 64 / 175
S-expressions: quote & friends
Meet quasiquote▶ Simlarly to quote, the form quasiquote can be written in two
ways:▶ (quasiquote <sexpr>)▶ `<sexpr>
▶ quasiquote is also used to define data:▶ `abc is the same as 'abc▶ `(a b c) is the same as '(a b c)
▶ But quasiquote has two neat tricks!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 65 / 175
S-expressions: quote & friends
Meet quasiquote▶ The following two forms may occur within aquasiquote-expression:
▶ The unquote form:▶ (unquote <sexpr>)▶ ,<sexpr>
▶ The unquote-splicing form:▶ (unquote-splicing <sexpr>)▶ ,@<sexpr>
▶ Both unquote & unquote-splicing are used to mix indynamic data into the data defined with quasiquote
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 66 / 175
S-expressions: quote & friends
Meet quasiquote> '(a (+ 1 2 3) b)(a (+ 1 2 3) b)> '(a ,(+ 1 2 3) b)(a ,(+ 1 2 3) b)> `(a (+ 1 2 3) b)(a (+ 1 2 3) b)> `(a ,(+ 1 2 3) b)(a 6 b)> `(a ,(append '(x y) '(z w)) b)(a (x y z w) b)> `(a ,@(append '(x y) '(z w)) b)(a x y z w b)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 67 / 175
S-expressions: quote & friends
Meet quasiquote▶ The expression `(a ,(append '(x y) '(z w)) b) is
equivalent to (cons 'a (cons (append '(x y) '(z w))'(b)))
▶ The expression `(a ,@(append '(x y) '(z w)) b) isequivalent to (cons 'a (append (append '(x y) '(z w))'(b)))
▶ The difference between unquote & unquote-splicing is that▶ unquote mixes in an expression using cons▶ unquote-splicing mixes in an expression using append
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 68 / 175
S-expressions: quote & friends
Meet quasiquote▶ Together, quasiquote, unquote, & unquote-splicing are
known as the quasiquote mechanism or the backquotemechanism
▶ The quasiquote mechanism allows us to create data bytemplate, that is, by specifying the shape of the data
▶ In Scheme, convenient ways to create data translateimmediately into convenient ways to create code
▶ Therefore we expect the quasiquote mechanism to have usefulapplications within programming languages
▶ We can turn code that computes something into code thatshows us a computation…
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 69 / 175
S-expressions: quote & friends
Consider the familiar factorial function:(define fact
(lambda (n)(if (zero? n)
1(* n (fact (- n 1))))))
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 70 / 175
S-expressions: quote & friends
We use the quasiquote mechanism to convert the application (* n(fact (- n 1))) into code that describes what factorial does:(define fact
(lambda (n)(if (zero? n)
1`(* ,n ,(fact (- n 1))))))
Running (fact 5) now gives:
> (fact 5)(* 5 (* 4 (* 3 (* 2 (* 1 1)))))
As you can see, factorial now prints a trace of the computation.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 71 / 175
S-expressions: quote & friends
We are now going to use the quasiquote mechanism to get Schemeto teach us about the structure of S-expressions.Consider the following code:(define foo
(lambda (e)(cond ((pair? e)
(cons (foo (car e))(foo (cdr e))))
((or (null? e) (symbol? e)) e)(else e))))
What does this program do?
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 72 / 175
S-expressions: quote & friends
Let’s call foo with some arguments:
> (foo 'a)a> (foo 123)123> (foo '())()> (foo '(a b c))(a b c)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 73 / 175
S-expressions: quote & friendsLooking over the code again(define foo
(lambda (e)(cond ((pair? e)
(cons (foo (car e))(foo (cdr e))))
((or (null? e) (symbol? e)) e)(else e))))
we notice that:▶ The 2nd and 3rd ribs of the cond overlap [we could have
removed the 2nd]▶ All atoms are left unchanged▶ All pairs are duplicated, while recursing over the car and cdr of
the pairSo foo does nothing, though it does it recursively! ☺
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 74 / 175
S-expressions: quote & friends
We now use the quasiquote mechanism to cause foo to generate atrace:(define foo
(lambda (e)(cond ((pair? e)
`(cons ,(foo (car e)),(foo (cdr e))))
((or (null? e) (symbol? e)) `',e)(else e))))
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 75 / 175
S-expressions: quote & friendsRunning foo now gives us some interesting data:
> (foo 'a)'a> (foo '(a b c))(cons 'a (cons 'b (cons 'c '())))> (foo '(a 1 b 2))(cons 'a (cons 1 (cons 'b (cons 2 '()))))> (foo 123)123> (foo '((a b) (c d)))(cons(cons 'a (cons 'b '()))(cons (cons 'c (cons 'd '())) '()))
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 76 / 175
S-expressions: quote & friends
▶ Using the quasiquote mechanism, we got foo to describe howS-expressions are created using the most basic API.
▶ We should really add support for proper lists and vectors!▶ In fact, the name describe is far more appropriate than foo…
Let’s rewrite foo…
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 77 / 175
S-expressions: quote & friends
(define describe(lambda (e)
(cond ((list? e)`(list ,@(map describe e)))
((pair? e)`(cons ,(describe (car e))
,(describe (cdr e))))((vector? e)`(vector
,@(map describe(vector->list e))))
((or (null? e) (symbol? e)) `',e)(else e))))
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 78 / 175
S-expressions: quote & friends
Running describe on various S-expressions is very instructive:
> (describe '(a b c))(list 'a 'b 'c)> (describe '#(a b c))(vector 'a 'b 'c)> (describe '(a b . c))(cons 'a (cons 'b 'c))> (describe ''a)(list 'quote 'a)
Wait! What’s with the last example?!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 79 / 175
S-expressions: quote & friends
Recall what we said about quote, quasiquote, unquote, &unquote-splicing:
▶ (quote <sexpr>) ≡ '<sexpr>▶ (quasiquote <sexpr>) ≡ `<sexpr>▶ (unquote <sexpr>) ≡ ,<sexpr>▶ (unquote-splicing <sexpr>) ≡ ,@<sexpr>
Now we get to see this happen…
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 80 / 175
S-expressions: quote & friends
Now we get to see this happen:
> (describe ''<sexpr>)(list 'quote '<sexpr>)> (describe '`<sexpr>)(list 'quasiquote '<sexpr>)> (describe ',<sexpr>)(list 'unquote '<sexpr>)> (describe ',@<sexpr>)(list 'unquote-splicing '<sexpr>)
Rule: Every Scheme expression used to be an S-expression when itwas little!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 81 / 175
S-expressions: quote & friends
QuestionWhat is (length '''''''''''''''''moshe) ?👎 17👎 16👎 Generates an error message!👎 1👍 2
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 82 / 175
S-expressions: quote & friends
Explanation(length '''''''''''''''''moshe) is the same as (length'(quote <something>)), where <something> is'''''''''''''''moshe, but that really doesn’t matter! We are stillcomputing the length of a list of size 2:
▶ The first element of the list is the symbol quote▶ The second element of the list is '''''''''''''''moshe
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 83 / 175
S-expressions: quote & friends (continued)
QuestionThe structure of the S-expression ''a in Scheme is:👎 Just the symbol a👎 The proper list (quote . (a . ()))👎 The proper list (quote . (quote . (a . ())))👎 An invalid S-expression👍 The nested proper list (quote . ((quote . (a . ())) .
()))
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 84 / 175
Further reading
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 85 / 175
Chapter 2
Goals🗸 The pipeline of the compiler🗸 Introduction to syntactic analysis☞ Further steps in ocaml
Agenda☞ Ocaml
▶ Types▶ References▶ Modules & signatures▶ Functional programming in ocaml
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 86 / 175
Introduction to ocaml (2)
Still need to coverTo program in ocaml effectively in this course , we still need to learnsome additional topics:
▶ Defining new data types▶ Assignments, side-effects,
What we shan’t coverObject Orientation: Once you’re comfortable with the ocaml, youmight like to pick up the object-oriented layer. As object-orientationgoes, you should find it to be sophisticated and expressive.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 87 / 175
Types
New types are defined using the type statement:
type fraction = {numerator : int; denominator : int};;
The above statement defines a new type fraction as a recordconsisting of two fields: numerator & denominator, both of typeint.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 88 / 175
Types (continued)
Once fraction has been defined, the underlying system recognizesit for all records with these fields & types:
# {numerator = 2; denominator = 3};;- : fraction = {numerator = 2; denominator = 3}# {denominator = 3; numerator = 2};;- : fraction = {numerator = 2; denominator = 3}
Notice that the order of the fields in a record is immaterial, becausethe fields are accessed through their names, which are convertedconsistently into offsets.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 89 / 175
Types (continued)The type-inference engine in ocaml will correctly infer newly-definedtypes:
let add_fractions f1 f2 =match f1, f2 with| {numerator = n1; denominator = d1},{numerator = n2; denominator = d2} ->{numerator = n1 * d2 + n2 * d1;denominator = d1 * d2};;
And of course:
# add_fractions{numerator = 2; denominator = 3}{numerator = 4; denominator = 5};;
- : fraction = {numerator = 22; denominator = 15}
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 90 / 175
Types (continued)
We can define disjoint types as follows:
type number =| Int of int| Frac of fraction| Float of float;;
Think of the | as disjunction. The initial | is optional in ocaml.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 91 / 175
Types (continued)
We can now define a list of numbers as follows:
# [Int 3;Frac {numerator = 3;
denominator = 4};Float (4.0 *. atan(1.0))];;
- : number list =[Int 3;Frac {numerator = 3;
denominator = 4};Float 3.14159265358979312]
Notice that ocaml had no trouble identifying each of the threeelements of the list as belonging to type number.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 92 / 175
Types (continued)
Working with disjoint typesUse match to dispatch over the corresponding type constructor, andmake sure you handle each and every possibility!
let number_to_string x =match x with| Int n -> Format.sprintf "%d" n| Frac {numerator = num; denominator = den} ->
Format.sprintf "%d/%d" num den| Float x -> Format.sprintf "%f" x;;
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 93 / 175
Types (continued)
Working with disjoint types (continued)And here’s how it looks:
# number_to_string (Int 234);;- : string = "234"# number_to_string (Frac {numerator = 2; denominator = 5});;- : string = "2/5"# number_to_string (Float 234.234);;- : string = "234.234000"
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 94 / 175
References
Let us take another look at the record-type. Recall the definition offraction:
# type fraction = {numerator : int; denominator : int};;type fraction = { numerator : int; denominator : int; }
In the function add_fractions we used pattern-matching to accessthe record-fields.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 95 / 175
References (continued)
Ocaml lets you access fields directing, using the dot-notation that isfamiliar from object-oriented programming:
# {numerator = 3; denominator = 5}.numerator;;- : int = 3# {numerator = 3; denominator = 5}.denominator;;- : int = 5
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 96 / 175
References (continued)
Ocaml offers a special record-type known as a reference.▶ References are derived types. For any type α, we can have a
type α ref.▶ References are records with a single field contents▶ References have a special syntax ! to dereference the field:
# {contents = 1234};;- : int ref = {contents = 1234}# {contents = 1234}.contents;;- : int = 1234# ! {contents = 1234};;- : int = 1234
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 97 / 175
References (continued)▶ References have a special syntax := for assignment▶ This is how assignments are managed in ocaml
# let x = ref 1234;;val x : int ref = {contents = 1234}# x;;- : int ref = {contents = 1234}# !x;;- : int = 1234# x := 4567;;- : unit = ()# x;;- : int ref = {contents = 4567}# !x;;- : int = 4567
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 98 / 175
References (continued)
▶ It is not possible to perform assignments on variables▶ It is only possible to change the fields of reference types
# let x = "abc";;val x : string = "abc"# x := "def";;Characters 0-1:x := "def";;^
Error: This expression has type stringbut an expression was expected of type
'a ref
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 99 / 175
References (continued)
▶ You can define a reference type of any other type, includingother reference types:
# let x = ref (ref 1234);;val x : int ref ref = {contents = {contents = 1234}}# x := ref 5678;;- : unit = ()# x;;- : int ref ref = {contents = {contents = 5678}}# !x := 9876;;- : unit = ()# x;;- : int ref ref = {contents = {contents = 9876}}
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 100 / 175
Modules, signatures, functors
Modules▶ A module is a way of packaging functions, classes, variables, &
types▶ A signature is the type of a module
▶ Visibility of a module can be restricted through the signature▶ Functors are functions from functors/modules to
functors/modules
Goals▶ Learn to work with existing modules▶ Learn to write your own modules
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 101 / 175
Modules, signatures, functors (continued)We define the function hyp to compute the hypotenuse of a triangle,given two sides and the angle between them (cosine law). We usethe auxiliary function square:
# module M = structlet square x = x *. xlet hyp a b theta =sqrt((square a) +. (square b) -.
2.0 *. a *. b *. (cos theta))end;;
module M : sigval square : float -> floatval hyp : float -> float -> float -> float
end
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 102 / 175
Modules, signatures, functors (continued)
Both M.square and M.hyp are visible:
# M.hyp;;- : float -> float -> float -> float = <fun># M.square;;- : float -> float = <fun># M.square 2.0;;- : float = 4.# M.hyp 3.5 5.6 0.645771823239;;- : float = 3.50763282088818817
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 103 / 175
Modules, signatures, functors (continued)We define the module type based on the returned signature of M,but with the square function removed:# module type SigHyp = sig
val hyp : float -> float -> float -> floatend;;
module type SigHyp = sigval hyp : float -> float -> float -> float
end# module M : SigHyp = struct
let square x = x *. xlet hyp a b theta =sqrt((square a) +. (square b) -.
2.0 *. a *. b *. (cos theta))end;;
module M : SigHypMayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 104 / 175
Modules, signatures, functors (continued)Visibility is now restricted:
▶ M.hyp is visible from outside M▶ M.square is not visible from outside M▶ Functions visible from outside may use functions visible from
inside
# M.hyp;;- : float -> float -> float -> float = <fun># M.square;;Characters 0-8:M.square;;^^^^^^^^
Error: Unbound value M.square# M.hyp 3.5 5.6 0.645771823239;;- : float = 3.50763282088818817
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 105 / 175
Modules, signatures, functors (continued)
Summary▶ Modules & signatures are the way to package functions &
control visibility▶ Convenient, super-efficient, safe▶ No need to use local, nested functions to manage visibility▶ Always use signatures to control visibility!
Learn on your own▶ Modules can contain types too, and be used to parameterize
code with types▶ Simpler & better than generics & templates
▶ Functors map modules/functors =⇒ modules/functors
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 106 / 175
Further reading
🕮 The Objective Caml Programming Language, Chapter 12🔗 An online tutorial on ocaml modules
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 107 / 175
Parsing Techniques
Dozens of parsing algorithms are known:▶ Parsing algorithms are tailored to a specific kind of grammar
▶ Different kinds of grammars can be parsed by differentalgorithms
▶ Different kinds of grammars have different levels of complexityon the Chomsky Hierarchy
▶ Most programming languages can be described usingcontext-free grammars
▶ Some older languages can only be described usingcontext-sensitive grammars
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 108 / 175
Parsing Techniques (continued)
Context-free Grammars (CFGs)A CFG is a structure of the form G = ⟨V,Σ,R, S⟩:
▶ V is a set of non-terminals▶ Σ is a set of terminals, or tokens▶ R is a relation in V × (V ∪ Σ)∗
▶ Members of R are called production rules or rewrite rules▶ S is the an initial non-terminal
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 109 / 175
Parsing Techniques (continued)
Context-free Grammars (conveniences)▶ We abbreviate the two productions ⟨A,X⟩ , ⟨A,Y⟩ ∈ R with
⟨A,X | Y⟩ (disjunction)▶ We abbreviate the three productions ⟨A,X⟩ , ⟨X, ε⟩ , ⟨X,BX⟩ ∈ R,
where X has no other productions, with ⟨A,B∗⟩, (Kleene-star)▶ We abbreviate the three productions
⟨A,X⟩ , ⟨X,B⟩ , ⟨X,BX⟩ ∈ R, where X has no other productions,with ⟨A,B+⟩, (Kleene-plus)
▶ We abbreviate the two productions ⟨A, ε⟩ , ⟨A,B⟩ ∈ R, with⟨A,B?
⟩(maybe)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 110 / 175
Parsing Techniques (continued)
The two basic approaches to parsing CFG are top-down & bottom-up:
Top-down algorithms▶ Start with the initial non-terminal▶ Rewrite the LHS of a non-terminal with its RHS, matching the
input stream of tokens▶ Keep rewriting until the entire input stream is matched
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 111 / 175
Parsing Techniques (continued)
The two basic approaches to parsing CFG are top-down & bottom-up:
Bottom-up algorithms▶ Start with the input stream of tokens▶ Find a rewrite rule where the RHS matches sequences in the
input, and rewrite them to the LHS, reducing several items to asingle non-terminal
▶ Keep rewriting until the entire input stream has been reduced tothe initial non-terminal
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 112 / 175
Parsing Techniques (continued)
How most parsing algorithms are used▶ Describe the grammar of the language using a DSL for some
restricted CFG▶ Example: Backus-Naur Form (BNF)
▶ Associate actions with each production rule:▶ How to build the AST when a specific rule is matched
▶ A parser generator (e.g., yacc, bison, antlr, etc) compiles thegrammar:
▶ Performing various optimizations▶ Generating code in some language (C, Java, ocaml, etc)▶ This code is the parser
▶ Calling the parser on some input returns an AST
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 113 / 175
Parsing Techniques (continued)
Goals of parsing algorithms▶ Minimal restrictions on the grammar▶ Avoid backtracking as much as possible▶ Maximum optimizations of the parser
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 114 / 175
Parsing Combinators
A technique for embedding a specification of a grammar into aprogramming language:
▶ Parsers for larger languages are composed from parsers forsmaller languages
▶ The grammar can be written & debugged bottom-up▶ The parsers are first-class objects:
▶ We get to use abstraction to create complex parsers quickly &simply
▶ Re-use effectively common sub-languages▶ Simple to understand & implement▶ Very rapid development
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 115 / 175
Parsers Combinators (continued)Parsing combinators do have some disadvantages:
▶ The grammar is embedded as-is:▶ As much backtracking as implied by the grammar: Rewrite
rules that have large common prefixes are going to requireplenty of backtracking:
A → xByCzDtA → xByCzDw
· · ·
▶ No optimizations or transformations are performed on it!▶ ε-productions & left-recursion result in infinite loops
▶ We need to eliminate these manually!▶ Can produce inefficient parsers rather efficiently! 😉
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 116 / 175
Parsers Combinators (continued)
Nevertheless:▶ Parsing combinators are very simple to learn about grammars:
▶ No complex algorithms are necessary!▶ The easiest way to design complex grammars & their parsers:
Abstraction —▶ shortens & simplifies the code▶ encourages re-use & consistency
▶ Optimizations can always be done manually!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 117 / 175
Parsers Combinators (continued)
Our parsing combinators take lists of characters for input, and returnan AST. We start with code to convert strings to lists of characters:
let string_to_list str =let rec loop i limit =if i = limit then []else (String.get str i) :: (loop (i + 1) limit)
inloop 0 (String.length str);;
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 118 / 175
Parsers Combinators (continued)
We shall also want to generate a string from a list of characters:
let list_to_string s =let rec loop s n =match s with| [] -> String.make n '?'| car :: cdr ->
let result = loop cdr (n + 1) inBytes.set result n car;result
inloop s 0;;
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 119 / 175
Parsers Combinators (continued)
Sometimes our parsers must fail on their input. When this happens,we raise an exception (which in other languages is called throwing anexception).We should therefore define an exception:
exception X_no_match;;
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 120 / 175
Parsers Combinators (continued)
Parsing combinators are compositional. This means▶ We build parsers of large languages by combining parsers for
smaller [sub-]languages▶ The procedures that combine parsers are called parsing
combinators (PCs)▶ But we must start by being able to parse single characters
▶ All other parsers are built on top of such simple parsers forsingle characters
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 121 / 175
Parsers Combinators (continued)
The const PC takes a predicate (char -> bool), and return aparser that recognizes this character:
let const pred =function| [] -> raise X_no_match| e :: s ->
if (pred e) then (e, s)else raise X_no_match;;
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 122 / 175
Parsers Combinators (continued)
We define the non-terminal that recognizes the capital letter 'A' bycalling const with a predicate that returns true if its argument isequal to 'A':
# let ntA = const (fun ch -> ch = 'A');;val ntA : char list -> char * char list = <fun>
Notice that ntA▶ …takes a list of characters▶ …returns a pair of what it matched, and the remaining characters
This is the structure of all parsers written using PCs
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 123 / 175
Parsers Combinators (continued)
Using ntA# ntA ['A'; 'B'; 'C'];;- : char * char list = ('A', ['B'; 'C'])# ntA [];;Exception: PC.X_no_match.# ntA ['a'; 'A'];;Exception: PC.X_no_match.
▶ We only match the head of the input▶ Obviously, ntA fails on an empty list
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 124 / 175
Parsers Combinators (continued)
▶ Testing our parsers by applying them to lists is no fun▶ It’s a pain to type lists of characters!
▶ Let’s automate things a bit:let test_string nt str =let (e, s) = (nt (string_to_list str)) in(e, (Printf.sprintf "->[%s]" (list_to_string s)));;
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 125 / 175
Parsers Combinators (continued)
We can now test more easily:
# test_string ntA "";;Exception: PC.X_no_match.# test_string ntA "Abc";;- : char * string = ('A', "->[bc]")
This is only for testing! When we deploy our parser, we’ll call itdirectly.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 126 / 175
Parsers Combinators (continued)
Constant parsers are not very useful! Let’s consider catenation:
let caten nt1 nt2 =fun s ->let (e1, s) = (nt1 s) inlet (e2, s) = (nt2 s) in((e1, e2), s);;
▶ We try to parse the head of s using nt1▶ If we succeed, we get e1 and the remaining chars s▶ We try to parse the head of s (what remained after nt1) using
nt2▶ If we succeed, we get e2 and the remaining chars s▶ We return the pair of e1 & e2, as well as the remaining chars
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 127 / 175
Parsers Combinators (continued)We define and test the parser for A followed by B:# let ntAB =caten (const (fun ch -> ch = 'A'))
(const (fun ch -> ch = 'B'));;val ntAB : char list -> (char * char) * char list = <fun># test_string ntAB "ABC";;- : (char * char) * string = (('A', 'B'), "->[C]")# test_string ntAB "abc";;Exception: PC.X_no_match.# test_string ntAB "Abc";;Exception: PC.X_no_match.# test_string ntAB "AB";;- : (char * char) * string = (('A', 'B'), "->[]")# test_string ntAB "A Bcdef";;Exception: PC.X_no_match.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 128 / 175
Parsers Combinators (continued)
We now consider disjunction of two parsers:
let disj nt1 nt2 =fun s ->try (nt1 s)with X_no_match -> (nt2 s);;
▶ We try to parse the head of s using nt1▶ If we succeed, then the call to nt1 returns normally▶ If we fail we try to parse the head of s using nt2
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 129 / 175
Parsers Combinators (continued)We define and test the parser for either A or a:
# let ntA_or_a =disj (const (fun ch -> ch = 'A'))
(const (fun ch -> ch = 'a'));;val ntA_or_a : char list -> char * char list = <fun># test_string ntA_or_a "";;Exception: PC.X_no_match.# test_string ntA_or_a "this won't work either";;Exception: PC.X_no_match.# test_string ntA_or_a "A nice example";;- : char * string = ('A', "->[ nice example]")# test_string ntA_or_a "a nice example";;- : char * string = ('a', "->[ nice example]")
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 130 / 175
Parsers Combinators (continued)
What next?▶ Some simple parsers▶ Learn about the algebra of PCs▶ Learn of new PC operators▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 131 / 175
Some simple parsers
let nt_epsilon s = ([], s);;let nt_none _ = raise X_no_match;;let nt_end_of_input = function| [] -> ([], [])| _ -> raise X_no_match;;
▶ nt_epsilon is the parser that recognizes ε-productions▶ nt_none is the parser that always fails▶ nt_end_of_input is the parser that recognizes the end of the
input stream (and fails otherwise)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 132 / 175
Parsers Combinators (continued)
What next?🗸 Some simple parsers▶ Learn about the algebra of PCs▶ Learn of new PC operators▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 133 / 175
The Algebra of PCsWhy do nt_epsilon & nt_end_of_input match with the emptylist []?This has to do with the Algebra of parsing combinators:
▶ What is the unit element of catenation?▶ Answer: r = ε▶ We’re looking for a non-terminal r such that for any s, we have
rs = sr = s…▶ This means that nt_epsilon is the unit element for caten:
▶ caten nt_epsilon nt ≡ caten nt nt_epsilon ≡ nt▶ Both nt_epsilon & nt_end_of_input are used ’til the end of
something▶ The natural operation is to create a list of all things until ε or
the end-of-input are reached▶ The unit element for append on lists is the empty list▶ Ergo, it is natural to match [] when either condition is
encountered
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 134 / 175
The Algebra of PCs (continued)
Similarly, nt_none is the unit element in the algebra of disjuction:disj nt nt_none ≡ disj nt_none nt ≡ nt
☞ Later on, we shall use the algebra of PCs together with foldingoperations to create complex parsers easily
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 135 / 175
Parsers Combinators (continued)
What next?🗸 Some simple parsers🗸 Learn about the algebra of PCs▶ Learn of new PC operators▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 136 / 175
New PC OperatorsIdentifying the characters, or pairs of characters, etc that match agrammar is often not enough:
▶ We want to be able to create an AST for that piece of syntax▶ We do this by specifying postprocessing or callback functions
over the expression that was matched.▶ In our package, the PC that performs this is called packlet pack nt f =fun s ->let (e, s) = (nt s) in((f e), s);;▶ pack takes a non-terminal nt and a function f
▶ returns a parser that recognizes the same language as nt▶ …but which applies f to whatever was matched
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 137 / 175
Parsing combinators (continued)
Example: Identifying digits# let nt_digit_0_to_9 =const (fun ch -> '0' <= ch && ch <= '9');;val nt_digit_0_to_9 :
char list -> char * char list = <fun># test_string nt_digit_0_to_9 "234";;- : char * string = ('2', "->[34]")# let nt_digit_0_to_9 =pack (const (fun ch -> '0' <= ch && ch <= '9'))
(fun ch -> (int_of_char ch) - ascii_0);;val nt_digit_0_to_9 :
char list -> int * char list = <fun># test_string nt_digit_0_to_9 "234";;- : int * string = (2, "->[34]")
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 138 / 175
Recursive productions
▶ Grammars are often recursive or mutually-recursive:▶ The non-terminal on the LHS of a production often appears on
the RHS (recursion)▶ The non-terminal on the LHS of a production often appears in
one of the RHSs of the transitive-reflexive closure of therelation (mutual recursion)
▶ Currently, we are unable to describe recursive rules using PCs
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 139 / 175
Recursive productions (continued)
We are unable to describe recursive rules using PCs:
⟨A⟩ → ( (⟨A⟩∗|ε) )
▶ The non-terminal A▶ The open-parenthesis token▶ The close-parenthesis token
▶ Nevermind that we don’t yet have star…▶ We can’t use nt_A before it’s defined!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 140 / 175
Recursive productions (continued)
We are unable to describe recursive rules using PCs:let nt_A =caten (const (fun ch -> ch = '('))
(caten (disj (star nt_A)nt_epsilon)
(const (fun ch -> ch = ')')));;
▶ Nevermind that we don’t yet have star…▶ We can’t use nt_A before it’s defined!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 141 / 175
Recursive productions (continued)We are unable to describe recursive rules using PCs:
▶ The problem is not specific to parsing combinators.▶ For example, you couldn’t define in Scheme:
(define f (g (h f)))because you can’t use something before it’s defined! (Ok, insome languages you can!)
▶ So how are recursive definitions possible at all?▶ When you define a recursive function you are not using the
function before it’s defined▶ You are using the address of the function before the function is
defined▶ Recursive functions are circular data structures:
▶ The language definition permits you to define these particularcircular structures statically, rather than at run-time
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 142 / 175
Parsing combinators (continued)
To implement recursive parsers, we need to delay the evaluation ofthe recursive non-terminal
▶ ”Wrap it in a lambda…”
let delayed thunk =fun s -> thunk() s;;
▶ A thunk is a procedure that takes zero arguments▶ Thunks are used to delay evaluation
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 143 / 175
Recursive productions (continued)
Example: Identifying digits (continued)# let nt_natural =let rec make_nt_natural () =pack (caten nt_digit_0_to_9
(disj (delayed make_nt_natural)nt_epsilon))
(function (a, s) -> a :: s) inmake_nt_natural();;
val nt_natural : char list -> int list * char list = <fun>
▶ Notice the packing function (function (a, s) -> a :: s)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 144 / 175
Recursive productions (continued)
Example: Identifying digits (continued)# test_string nt_natural "1234";;- : int list * string = ([1; 2; 3; 4], "->[]")
We are not done yet:▶ We got a list of digits, as opposed to a list of chars!☞ We want to left-fold these digits into a number in base 10
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 145 / 175
Parsers combinators (continued)We pack the list of digits using a left-fold:# let nt_natural =let rec make_nt_natural () =pack (caten nt_digit_0_to_9
(disj (delayed make_nt_natural)nt_epsilon))
(function (a, s) -> a :: s) inpack (make_nt_natural())
(fun s ->(List.fold_left
(fun a b -> 10 * a + b)0s));;
val nt_natural : char list -> int * char list = <fun>▶ Notice the type of the parser: char list -> int * char list
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 146 / 175
Recursive productions (continued)
Testing it:
# test_string nt_natural "1234";;- : int * string = (1234, "->[]")
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 147 / 175
Recursive productions (continued)
The parser ntParen expresses the grammar of one set ofarbitrarily-nested parentheses:
# let rec ntParen s =pack (caten (const (fun ch -> ch = '('))
(caten (disj (delayed (fun _ -> ntParen))(pack nt_epsilon
(fun _ -> "ntParen")))(const (fun ch -> ch = ')'))))
(fun _ -> "ntParen") s ;;val ntParen :char list -> string * char list = <fun>
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 148 / 175
Recursive productions (continued)
Testing ntParen on various inputs:
# test_string ntParen "()";;- : string * string = ("ntParen", "->[]")# test_string ntParen "";;Exception: PC.X_no_match.# test_string ntParen "((()))";;- : string * string = ("ntParen", "->[]")# test_string ntParen "((())())";;Exception: PC.X_no_match.# test_string ntParen "((()))ABC";;- : string * string = ("ntParen", "->[ABC]")
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 149 / 175
Parsing combinators (continued)
▶ By now, our toolset of parsing combinators consists of▶ const▶ caten▶ disj▶ pack▶ delayed
▶ We can handle recursive grammars▶ We can create ASTs▶ In principle, we can implement parsers for any language☞ We now wish to add additional PCs to simplify the task of
writing parsers
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 150 / 175
New PC Operators (continued)
The Kleene StarThe Kleene-star is ameta-production-rule, or arule-schema, or a ”macro” overproduction-rules.
▶ For any NT P, P∗ stands for therule Pstar defined as follows:
Pstar → P Pstar | ε
▶ The point of the Kleene-star isto recognize the catenation ofzero or more expressions in P.
Stephen ColeKleene
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 151 / 175
New PC Operators (continued)
Here is our support for the Kleene-star:
let rec star nt =fun s ->try let (e, s) = (nt s) in
let (es, s) = (star nt s) in(e :: es, s)
with X_no_match -> ([], s);;
Notice how we match ε implicitly.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 152 / 175
New PC Operators (continued)
The Kleene-plus▶ For any NT P, P+ stands for the rule Pplus defined as follows:
Pplus → P Pplus | P
▶ The point of the Kleene-plus is to recognize the catenation ofone or more expressions in P.
▶ Kleene didn’t really invent the Kleene-plus▶ Rather, Kleene-plus is a natural extension of Kleene-star
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 153 / 175
New PC Operators (continued)
Here is our support for the Kleene-plus:
let plus nt =pack (caten nt (star nt))
(fun (e, es) -> (e :: es));;
Notice how we define the Kleene-plus as the catenation ofKleene-star and the original NT.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 154 / 175
New PC Operators (continued)Let’s test star and plus:# let star_star = star (const (fun ch -> ch = '*'));;val star_star : char list -> char list * char list = <fun># let star_plus = plus (const (fun ch -> ch = '*'));;val star_plus : char list -> char list * char list = <fun># test_string star_star "****the end!";;- : char list * string =
(['*'; '*'; '*'; '*'], "->[the end!]")# test_string star_plus "****the end!";;- : char list * string =(['*'; '*'; '*'; '*'], "->[the end!]")# test_string star_star "the end!";;- : char list * string = ([], "->[the end!]")# test_string star_plus "the end!";;Exception: PC.X_no_match.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 155 / 175
New PC Operators (continued)
Ocaml provides the polymorphic typeα option = none | Some of α as a way of dealing with situationswhere a value may or may not exist.We’re going to use α option to implement maybe, which takes aparser r, and returns a parser r? that recognizes zero or oneoccurrences of whatever is recognized by r.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 156 / 175
New PC Operators (continued)
let maybe nt =fun s ->try let (e, s) = (nt s) in
(Some(e), s)with X_no_match -> (None, s);;
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 157 / 175
New PC Operators (continued)
Assume you have the parser nt_integer, that recognizes integers.Here is how we might use maybe:
# test_string nt_integer "1234";;- : int * string = (1234, "->[]")# test_string (maybe nt_integer) "1234";;- : int option * string = (Some 1234, "->[]")# test_string (maybe nt_integer) "moshe";;- : int option * string = (None, "->[moshe]")
You would use pattern matching (via match) to handle both cases(None/Some)
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 158 / 175
New PC Operators (continued)
We might want to attach an arbitrary predicate to serve as a guardfor a parser, so that the parser succeeds only if the matched objectsatisfies the guard. This is what the guard PC does:
let guard nt pred =fun s ->let ((e, s) as result) = (nt s) inif (pred e) then resultelse raise X_no_match;;
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 159 / 175
New PC Operators (continued)
Let’s use guard to identify only even numbers:
# test_string (guard nt_integer (fun n -> n land 1 = 0))"12345";;
Exception: PC.X_no_match.# test_string (guard nt_integer (fun n -> n land 1 = 0))
"123456";;- : int * string = (123456, "->[]")
This exceeds the expressive power of CFGs!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 160 / 175
Parsers Combinators (continued)
What next?🗸 Some simple parsers🗸 Learn about the algebra of PCs🗸 Learn of new PC operators▶ Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 161 / 175
Functional abstraction in PCs
We now wish to demonstrate some examples of using functionalabstraction to write parsers in a general, consistent, and convenientway.Up to now we used to define single-character parsers using const:
let nt_A = const (fun ch -> ch = 'A');;
This is kind of clumsy. Let’s see how we can do this better!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 162 / 175
Functional abstraction in PCs (continued)
let make_char equal ch1 = const (fun ch2 -> equal ch1 ch2);;let char = make_char (fun ch1 ch2 -> ch1 = ch2);;let char_ci =make_char (fun ch1 ch2 ->
(Char.lowercase_ascii ch1) =(Char.lowercase_ascii ch2));;
The use of make_char allows us to define parser-generatingfunctions for characters, in a case-sensitive or case-insensitive way.
☞ Warning: The version of ocaml installed in the labs usesChar.lowercase, which is now deprecated. It’ll be upgraded[next year].
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 163 / 175
Functional abstraction in PCs (continued)
# test_string (char 'a') "abc";;- : char * string = ('a', "->[bc]")# test_string (char 'a') "ABC";;Exception: PC.X_no_match.# test_string (char_ci 'a') "abc";;- : char * string = ('a', "->[bc]")# test_string (char_ci 'a') "ABC";;- : char * string = ('A', "->[BC]")
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 164 / 175
Functional abstraction in PCs (continued)
If we wish to recognize entire words, this is still very cumbersome.We can put to a good use the algebra of catenation to do better: Toidentify a word, we —
▶ Take a string of chars, and convert it to a list▶ Map over each character in the list, creating a parser that
recognizes that character▶ Perofrm a right fold over that list using the caten operation
(with an approriate pack)▶ The unit element is the unit element of catenation, namely
epsilonBy abstracing over char we can get both case-sensitive andcase-insensitive variants!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 165 / 175
Functional abstraction in PCs (continued)
Here is the code:
let make_word char str =List.fold_right(fun nt1 nt2 -> pack (caten nt1 nt2)
(fun (a, b) -> a :: b))(List.map char (string_to_list str))nt_epsilon;;
let word = make_word char;;let word_ci = make_word char_ci;;
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 166 / 175
Functional abstraction in PCs (continued)
# test_string (word "moshe")"moshe is a nice guy!";;
- : char list * string =(['m'; 'o'; 's'; 'h'; 'e'], "->[ is a nice guy!]")
# test_string (word_ci "moshe")"Moshe is a nice guy!";;
- : char list * string =(['M'; 'o'; 's'; 'h'; 'e'], "->[ is a nice guy!]")
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 167 / 175
Functional abstraction in PCs (continued)
We might want to pick any single character in a string. Rather thanspecifying long disjunctions, we can use one_of to do this for us.
▶ Very similar to word:▶ We use disj rather than caten▶ The unit element for disj is nt_none
Such is the power of abstraction!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 168 / 175
Functional abstraction in PCs (continued)
let make_one_of char str =List.fold_rightdisj(List.map char (string_to_list str))nt_none;;
let one_of = make_one_of char;;let one_of_ci = make_one_of char_ci;;
As usual, we generate both the case-sensitive and case-insensitiveversions!
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 169 / 175
Functional abstraction in PCs (continued)
Let’s try out one_of:
# test_string (one_of "abcdef") "moshe!";;Exception: PC.X_no_match.# test_string (one_of "abcdef") "be moshe!";;- : char * string = ('b', "->[e moshe!]")
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 170 / 175
Functional abstraction in PCs (continued)
When we wanted to recognize a range of characters, we, once again,used the const PC. We can do better using abstraction:
let make_range leq ch1 ch2 (s : char list) =const (fun ch -> (leq ch1 ch) && (leq ch ch2)) s;;
let range = make_range (fun ch1 ch2 -> ch1 <= ch2);;let range_ci =make_range (fun ch1 ch2 ->
(Char.lowercase_ascii ch1) <=(Char.lowercase_ascii ch2));;
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 171 / 175
Functional abstraction in PCs (continued)
And here is how we can test range:
# test_string (star (range 'a' 'z')) "hello world!";;- : char list * string =
(['h'; 'e'; 'l'; 'l'; 'o'], "->[ world!]")# test_string (star (range 'a' 'z')) "HELLO WORLD!";;- : char list * string =
([], "->[HELLO WORLD!]")# test_string (star (range_ci 'a' 'z')) "Hello World!";;- : char list * string =
(['H'; 'e'; 'l'; 'l'; 'o'], "->[ World!]")
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 172 / 175
Functional abstraction in PCs (continued)How might you debug parsers written using PCs?
▶ The PC trace_pc is a wrapper (using the decorator pattern)that can be used to trace any parser
▶ The trace_pc PC takes a documentation string and a parser,and returns a tracing parser.
# test_string (trace_pc "The word \"hi\"" (word "hi"))"high";;
;;; The word "hi" matched the head of "high",and the remaining string is "gh"
- : char list * string = (['h'; 'i'], "->[gh]")# test_string (trace_pc "The word \"hi\"" (word "hi"))
"bye";;;;; The word "hi" failed on "bye"Exception: PC.X_no_match.
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 173 / 175
Parsers Combinators (continued)
What next?🗸 Some simple parsers🗸 Learn about the algebra of PCs🗸 Learn of new PC operators🗸 Learn how to use abstraction to make our life simpler
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 174 / 175
Further reading
🔗 Parsing Combinators
Mayer Goldberg \ Ben-Gurion University Compiler Construction October 31, 2018 175 / 175