practical use of automata and formal languages in the ... · creating three assignments covering...

31
BACHELOR I NFORMATICA Practical use of Automata and Formal Languages in the com- piler field Daan de Graaf June 9, 2017 Supervisor(s): Inge Bethke (UVA) Signed: Inge Bethke (UVA) I NFORMATICA —U NIVERSITEIT VAN A MSTERDAM

Upload: others

Post on 12-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

BACHELOR INFORMATICA

Practical use of Automata andFormal Languages in the com-piler field

Daan de Graaf

June 9, 2017

Supervisor(s): Inge Bethke (UVA)

Signed: Inge Bethke (UVA)

INF

OR

MA

TIC

A—

UN

IVE

RSI

TE

ITV

AN

AM

STE

RD

AM

Page 2: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

2

Page 3: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

Abstract

The principles in the study of Automata and Formal Languages (AFL) exert in a theoreticalmanner, without mentioning the applicability of the theory. To express the practical use, thispaper discusses three assignments covering the major subjects in the AFL theory. By using acompiler design approach, a source program consisting of simple arithmetic expressions getssuccessfully compiled and eventually executed. The execution leads to the solution of theexpression. The student complements the provided framework, with acquired knowledgefrom AFL theory, by implementing certain phenomena found in AFL as well as in compilerdesign theory. The assignments let the student put theory into practice and at the same timegive an insight into compiler design.

3

Page 4: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

4

Page 5: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

Contents

1 Introduction 71.1 Context . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.2 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71.3 Educational Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2 Background 92.1 Automata and Formal Languages . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

3 Choices 113.1 Python . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.2 Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113.3 The Compiler . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123.4 Grading . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

4 Design 134.1 Lexing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.1.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 164.2 Parsing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

4.2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3 Intermediate phases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

4.3.1 Type checking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3.2 Intermediate code generation . . . . . . . . . . . . . . . . . . . . . . . . . . 194.3.3 Register allocation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3.4 Machine code generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 204.3.5 Assembly and linking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

4.4 Executing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 214.4.1 The instruction set . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 224.4.2 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

5 Implementation 255.1 Assignment 1 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

5.1.1 Lexer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 255.1.2 Parser . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 26

5.2 Assignment 2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 265.3 Assignment 3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

6 Conclusion 296.1 Reflection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

Bibliography 31

5

Page 6: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

6

Page 7: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

CHAPTER 1

Introduction

1.1 Context

The principles in the study of Automata and Formal Languages, in this paper referred to as AFL,are mainly exerted in a theoretical manner. However, the phenomena found in this study arebroadly spread across computer science and are applicable in many disciplines. The theory isused in compilers, text processing, natural languages and genomes [7].The existing literature and books only provide exercises that capture the idea of certain princi-ples in AFL, neglecting its applications, or briefly mentioning it [5]. The idea of this researchis to design assignments for courses in the AFL study that reflect the applicability of such Au-tomata and Formal Languages.Since compiler design exists in the world of computer science, and thus has more value tocomputer science students, this discipline is chosen to be the thread inside this research. Af-ter investigating compiler design studies, it became clear that the created assignments shouldbe comprehensible and include some abstraction on the specific compiler elements, due to thecomplexity of compiler design, to guarantee some simplicity towards students.

This research lays the focus on compiling a source program, existing of simple arithmeticexpressions, to a final result, supposedly the answer to the expression, involving AFL and com-piler phenomena. In order to achieve this, the goal and objectives will be discussed first tocreate a definition and determine what the assignments should contain (and what not). Thenthe issue with current assignments/exercises is addressed in more detail and some solutions ofother literature are discussed. Eventually the design of the assignments will be presented.

1.2 Goal

The goal of this research is to give a basic understanding towards students of the practical usesof AFL by creating practical assignments.Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge with simple implementations for the specific subjects. Theassignments should provide a framework1 with an easy to use programming language andstructure, to advance implementing AFL aspects and abstracting advance work.The student will eventually have a common idea of particular applications and creating someinterest in the compiler field. At the same time the assignments should guarantee not to over-flow the student with too much new information.

1Pre-written code offering a basic setup taking care of I/O, printing, objects and corresponding error handling.

7

Page 8: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

1.3 Educational Objectives

To educate and give an understanding of the use of AFL, the focus lays on simplicity. Moststudents taking the course are in their first year of computer science and still learn program-ming. An excess of information on compiler theory will draw away the attention on the mainsubject. Reducing the information on compiling should accomplish this. However this comeswith some drawbacks as will be discussed later on.Since the main audience is starting writing code, it is stimulated to write as much code them-selves. The final framework should keep the balance between achieving programming skillsand at the same time keeping a feasible workload.

Encouraging students to understand the different concepts in the AFL theory is of equalimportance. The investigation of integrating these two aspects can be found further on.

8

Page 9: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

CHAPTER 2

Background

2.1 Automata and Formal Languages

As mentioned before, three main concepts are found in the AFL study: Finite Automata (FA),Push-Down Automata (PDA) and Turing Machines (TM). The overall explanation in found lit-erature makes use of a simple alphabet Σ = {a, b} or Σ = {0, 1} to express words like “a”, “bba”,“babaaba” or “0”, “110”, “1011000” [5, 1] and exercises are often found with corresponding au-tomata accepting a certain set of strings, for example: an even amount of 1’s and 0’s in a stringor a string containing the substring aba. This is defined as pattern matching and is an importantapplication of finite automata. These basic concepts show the capabilities of automata, howevernot the use of them.The PDA is mentioned as an extension on Finite Automata, accompanied with a memory it iscapable of handling non-regular languages like the set {anbn : n ≥ 0}.After mentioning the Turing machine, the word computability follows. However a TM is capa-ble of simulating logic for every computer existing and exploiting this is left out in most liter-ature. As denoted in [8], instead of looking at the state in which a TM halts, think of the finalcontents on the tape. In this perspective a TM implements a function with input and output.

2.2 Related Work

In the AFL field as well as in the compiler field, formal languages are coupled with compilerdesign. The book Formal Language: A practical introduction [8] makes use of simple arithmeticexpressions as an example for the use of AFL in relation to compiling. Using a provided frame-work written in Java, readers complement the framework by implementing functions in orderto parse expressions provided as user input. However, the given examples are for a specificgrammar only, which detracts from the power of specifying an infinite number of grammarswithout changing the code. In other words, when making adjustments to the grammar, thesource code has to change instead of only adjusting the language specific grammar. The ideasof AFL, context free grammar to be more specific, are used as a route to write the grammar in ahard-coded form and are not used as a grammar itself.The book also pays attention to the TM by providing exercises to TM functions. Readers areasked to create a TM function for multiplying, subtracting and dividing binary as well as unarynumbers. This lays the focus on the final contents of the tape, more then the state in which ithalts. From that point of view the TM is used as a calculator, or more general an executor ofcertain instructions.Furthermore, Basics of Compiler Design [6] provides a thorough introduction to the related AFLaspects for each phase in the compiling process. It further examines the construction of a parsetable, which is used later on for the parser, using simple arithmetic expressions. Removing am-biguity, the construction of a usable grammar and maintaining operator precedence eventuallyleads to all the knowledge to construct a simple compiler.

9

Page 10: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

In [4], the use of Python as a pseudo language for formal language theory is discussed. Withthe use of selected topics it is shown that Python is suitable for structures and algorithms inthe formal language theory. The selected topics that are used are not common topics foundin theory. Specifying a grammar for Roman numerals and performing a syntax analysis on aninput, is a perfect example of the application of formal languages.There is also a section about translation and using a grammar to translate from Roman to Arabicnumerals. However, this method is not used in this research, since other implementations cameacross and the approach may not be facilitating enough.

The exact same procedure of compiling and using AFL aspects, is done in [3]. It takes auser specified grammar as input and has several functions that can be applied to the gram-mar. The same principles are found in this research. Although the Github project is helpful, itprovides no challenges to the students and also has a to complicated structure for use in thisproject. However, some thoughts on the implementation of certain data structures came fromthis project.

10

Page 11: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

CHAPTER 3

Choices

Before the design and implementation part, some choices have been made in order to create asmaller scope for the project and make the assignments workable. This means some core aspectsof compiler design will be neglected or provided in understandable matter in order to advancethe focus on AFL.

3.1 Python

Python is an easy language to start with for a first year computer science student. Constrainingthem with another language has its drawbacks and is not beneficial for the yield of the assign-ments. As mentioned in [4], Python can significantly improve the study of the formal languagetheory and its applications. Taking that into account, no other languages where considered forthis project.

3.2 Framework

For each assignment a framework is provided. The framework serves as a start-up for the stu-dents, as well as helping future correctors with revising assignments. The framework containsthe following elements:

Error handling If an error (of any sort) occurs and the cause of this error is known, the programexits and returns a corresponding message about the problem.

I/O management Reading the source program from memory and writing the result back istaken care of by the framework. Thereby, the user input will be checked for an argumentspecifying the source file. The corresponding error messages regard: a not existing file ordirectory, no filepath specified or an unreadable source.

Functions To encourage the use of certain conventions, some basic functions are provided. Thefunctions help implementing the assignments and are provided with comments on howto use the function.

Objects The framework of each assignment contains an object with an AFL principle. Theprinciples FA, PDA and TM are used in assignments 1, 2, 3 respectively. The objects createa structure for automata and grammars, specified by the student or already present in theframework. This means, a grammar, provided in a certain data structure, can be read andstored in the object. The object can then be used to apply methods to.

11

Page 12: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

3.3 The Compiler

As briefly explained above, compiler design is part of computer science and has some interest-ing principles that can be found in AFL. There is also enough literature combining these twotheories that encourages to take a closer look. The choices that are made during the implemen-tation of the compiler design, erect from three origins.The first one stems from simplifying certain processes and phases. The principles in compilerdesign are abstracted for students in order to focus on the AFL implementation, this is done byproviding elements in the framework or skipping some parts.The second origin is the addition of elements. Some elements are added to the assignments tokeep it challenging or more consistent. The third origin stems from complications during theimplementation.All details are described in further sections.

3.4 Grading

Since the assignments are for educational purposes, the solutions to the assignments are com-pleted and provided in a separate framework. The framework can be used by teachers, assis-tants or correctors in order to check the assignments from the student and grade them. Thegrading scheme is not discussed in this paper, only brief comments about the challenging as-pects are provided. The reason for this is that the teacher may have a different view on whichmatter is more important and students may find that the grading does not suit the assignment(since subjects may be less or not covered).

12

Page 13: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

CHAPTER 4

Design

A basic compiler consists of seven compiling phases [6]:

Lexical analysis In this phase characters from the source code are divided into tokens. Tokenscan vary from an integer to a plus sign to a white space (see Section 4.1).

Syntax analysis The generated token list gets parsed and has to meet a specific structure. When-ever the code is not accepted, a syntax error with corresponding error message gets thrown(see Section 4.2).

Type checking In this phase the consistency requirements of the parsed code are checked.Whether a variable is used but not declared or assigning an integer to a string variable(see Section 4.3.1). This phase is skipped in this project.

Intermediate code generation In this phase the code is translated to an intermediate language(see Section 4.3.2).

Register allocation The variable names are translated to numbers which correspond to a regis-ter (see Section 4.3.3).

Machine code generation The intermediate code is translated to a machine specific assemblylanguage (see Section 4.3.4).

Assembly and linking The last phase translates the assembly language into a binary represen-tation and the addresses of functions, variables, etc., are determined (see Section 4.3.5).

In order to keep it simple, the focus lays on the lexical and syntax analysis and on the executionof the generated machine code (note this is not a compile phase). To accomplish a workingcompiler and at the same time not getting too detailed, some adjustments regarding the phaseshave to be made. By extracting the phases that make extensive use of AFL principles and for-saking the other phases for a minute, the scheme in Figure 4.1 remains. This is the core part ofthe assignments and the main implementation for students. In the next sections, each phase isclarified, followed by an explanation of how the left out phases are effective in each phase.

13

Page 14: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

Figure 4.1: Compiling scheme. Dashed arrows include intermediate phases, found in Section4.3.

The source code consists of one or multiple simple arithmetic expressions1, the expressionlanguage, which will be compiled and executed. The grammar in Figure 4.2 defines valid ex-pressions.

A→ id = EE→ EOE | (E) | int | idO→ + | − | ∗ | /

Figure 4.2: Grammar representation of valid expressions. Note this grammar will be modifiedfurther on and serves as a clarification.

Lowercase characters, including the operators and brackets, represent terminals to distin-guish them form non-terminals2.a = 2 + (3 ∗ 2) and b = a/4 are examples of expressions that can be used. Thereby, an id is acharacter which can be followed by other characters or digits. a, result1 or A2b4g are all validid’s. An int is a digit which can be followed by other digits, e.g. 1, 528 or 10000.Only the natural numbers N, can be properly used for computation, as this becomes clear lateron when using an unary representation of numbers. Moreover, the intermediate result of an ex-pression has to be a natural number too in order for the program to work properly. So althougha = 50 + (10− 23) results in a natural number (37), the intermediate result of (10− 23) equals−13 and is not a natural number.One can clearly see from the grammar that the use of the equal sign is obligatory. This meansthat every expression is an initialisation of a variable, which is useful for later on when the codeis executed (see Section 4.4).

4.1 Lexing

In the lexical analysis part of the compilation, an input stream of characters from the source codegets grouped into tokens. Each token is assigned an identifier in the form of an uppercase string.A deterministic finite automaton (DFA), as illustrated in Figure 4.4, is used as the tokenizer. Itcan be seen that an ID should always start with a character possibly followed by a character ora digit. A digit is any of the values in [0-9] and a character is any of the values in [a-zA-Z]3. Ifthere is no transition possible from the START state to an accepting state, the character is not

1The source code file can have multiple lines to use previously declared variables.2Symbols that can be replaced by other symbols are called non-terminals and symbols that cannot be replaced by

other symbols are called terminals [1].3This is done to omit a transition for every digit and character which would result in 62 additional state-transitions.

14

Page 15: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

part of the alphabet and thus not recognised.Tokenizing the input source Var = 34 + (5*56) results in the stream shown in Figure 4.3.

Figure 4.3: Tokens with corresponding identifiers.

In order to work properly, the longest stream that matches a token is tokenized. As an ex-ample: var22 is tokenized as var22 instead of var and 22. Noting that “var 22” has an incorrectsyntax.A white-space will always separate two tokens, since there is no white-space transition froman accepting state to an accepting state. The accepting WHITE state could also have had awhite-space transition to itself, since consecutive white-spaces still make a white-space. How-ever, considering the arithmetic expression format of the input, consecutive white-spaces areuncommon.

15

Page 16: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

Figure 4.4: A DFA for tokenizing the characters from the input stream.

4.1.1 Objectives

The DFA is fairly straightforward due to the simple tokens of an arithmetic expression. How-ever, the basis of using an FA for pattern matching and using regular expressions is there. Theresulting implementation, disregarding the character and digit simplification, should work onany given DFA for any input stream.The implementation should include the checking of each input character and the presence of avalid state transition to an accepting state. Checking multiple lines from the source and even-tually creating a decent structure to use in the next phase are also part of the implementation.

16

Page 17: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

4.2 Parsing

Syntax analysis recombines the created token list into a syntax tree which reflects the structureof the source[6]. This process is referred to as parsing. The parser checks if the syntax of thesource program is valid and throws a syntax error for an invalid syntax.Initially a DFA can be used to parse the tokens and check the syntax, since checking syntax isstill basic pattern matching. In Figure 4.5 a DFA which matches the syntax of the arithmeticexpressions is shown. For every input token a check for a transition from the current state ismade, if it eventually ends in an accepting state, the syntax is correct.Nevertheless, this method does not work correctly. As mentioned in [1, 5, 6] the set {anbn : n ≥0} is not a regular language and therefore cannot be accepted. The problem lays in the fact thatafter reading a certain amount of a’s, there is no memory to store the number of a’s and there-fore the DFA does not know how many b’s it should read. The same problem is found in thearithmetic expressions. Whenever an opening bracket occurs, a closing bracket should follow(with the occasion of any bracket pair nested inside those two). This principle is called balancedparentheses. The DFA in Figure 4.5 can only be used properly if there exist no brackets in thesource code or if the parentheses in the source code are balanced.Another reason the DFA cannot be used for proper parsing, is the lack of operator precedence.The automaton only accepts strings with a certain pattern and not taking precedence into con-sideration.

Figure 4.5: A DFA for parsing the tokens.

These obstacles can be avoided using a context-free grammar (CFG). The grammar in Figure4.6 is a LL(1)4 grammar which can be used for recursive descent parsing. The grammar isunambiguous and left-recursion is eliminated. The construction of such a grammar can befound in [6]. In this implementation the use of a syntax tree is skipped since the used grammaris left-factored and does not need a syntax tree. Instead, some semantic actions are embeddedinside the CFG, see Section 4.3.2, that replaces the need of a syntax tree.

4The first L indicates the reading direction (left-to-right), the second L indicates the derivation order (left) and the 1indicates that there is a one-symbol look-ahead [6].

17

Page 18: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

S→ A eof (4.1)A→ id = E (4.2)

E→ T E′ (4.3)

E′ → +++ T E′ (4.4)

E′ →−−− T E′ (4.5)

E′ → εεε (4.6)

T → F T′ (4.7)

T′ → ∗∗∗ F T′ (4.8)

T′ → /// F T′ (4.9)

T′ → εεε (4.10)F → (E) (4.11)F → int (4.12)F → id (4.13)

Figure 4.6: Unambiguous LL(1) grammar. Bold tokens indicate terminal symbols.

The automata used for top down parsing are push-down automata, characterised by a stackthat functions as the memory talked about earlier. It has the ability to push a symbol on top ofthe stack or to pop off a symbol on top of the stack. The items that get pushed or popped aredetermined by the input symbol, the current state the automaton is in and the symbol on topoff the stack, from now on referred to as stack-symbol. With this information, the automatoncan follow a transition with a corresponding production chosen from a parse table.Since the CFG is of a recursive nature, this means having the same non-terminal symbol on theleft as well as on the right side of the arrow (E′ → + T E′), determining how the expression isproduced from the grammar can be challenging. So to determine which production to apply,a look-ahead symbol is used. This symbol is the upcoming symbol from the input stream anddecides which production to match it with. On line 10 in Table 4.1, the look-ahead symbol is+ and the current state being E’. Using the parse table in Table 4.2 for the + symbol and thenon-terminal stack-symbol E’, gives production 4. This corresponds with production 4.4 fromthe grammar in Figure 4.6. So in the next step (11) in Table 4.1, the stack looks like this: $ eof E’T + . Whenever the top of the stack and the first symbol from the input are the same, they arematched and both removed from their stream. Eventually the stack will be empty, only includ-ing a $, meaning the syntax of the input is correct.

Table 4.1: First twelve steps of top down parsing the sample expression Var = 34 + (5 ∗ 56).

STACK INPUT OUTPUT1 $ S Var = 34 + (5 ∗ 56)2 $ eof A Var = 34 + (5 ∗ 56) s→ A eof3 $ eof E = id Var = 34 + (5 ∗ 56) A→ id = E4 $ eof E = = 34 + (5 ∗ 56) match5 $ eof E 34 + (5 ∗ 56) match6 $ eof E’ T 34 + (5 ∗ 56) E→ T E’7 $ eof E’ T’ F 34 + (5 ∗ 56) T→ F T’8 $ eof E’ T’ int 34 + (5 ∗ 56) F→ int9 $ eof E’ T’ +(5 ∗ 56) match

10 $ eof E’ +(5 ∗ 56) T’→ epsilon11 $ eof E’ T + +(5 ∗ 56) E’→ + T E’12 $ eof E’ T (5 ∗ 56) match

18

Page 19: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

Table 4.2: Parse table.

( ) + - * / int id eofS 1A 2E 3 3 3E’ 6 4 5 6T 7 7 7T’ 10 10 10 8 9 10F 11 12 13

4.2.1 Objectives

As explained, non regular languages cannot be accepted by finite automata and the solutionis found in CFG’s and PDA. The theoretical example of a number of a’s followed by the samenumber of b’s (balanced parentheses) is now obvious in a real world problem. The student comesacross this problem when first implementing the syntax analysis using the DFA approach andthen the need for more powerful automata is experienced. Since the walk-through of a DFA isdone at the lexing part, the approach is used as an introduction to the problem and not per sea challenge. However, the approach is appropriate for creating efficient error messages, sinceeach state reveals which symbols are expected. The goal then is to implement a PDA that makesuse of the CFG and the parse table shown above. Since some elements are provided in theframework, getting familiar with the code and used data structures can be another challenge,since there are multiple approaches as seen in Section 2.2.

4.3 Intermediate phases

At this point, compiling phase one and two are discussed, namely lexical analysis and syntaxanalysis respectively. The rest of the phases are discussed in this section.

4.3.1 Type checking

The type checking phase is left out in this concept of a compiler. The reason for this is that thechecking that should be done is surmountable and can be dealt with in other ways. Obviouschecks to perform are: checking if the (intermediate) result is a natural number, checking if theleft hand side of a subtraction is greater or equal to the right hand side and checking if the lefthand side of a division is a multiple of the right hand side. All of this is left out and dealt withat run-time. Evaluating this result beforehand is in a sense just the solution to the expressionand therefore not desirable.

4.3.2 Intermediate code generation

For the execution, a distinction between the elements are made on a higher level. Where the +has an ADD identifier in the tokenized code, in turn the ADD operation has an operator identifierin the intermediate code. In Table 4.3 there is an overview of the intermediate code identifiers.The distinction is needed for the simplification of the eventual execution. Since values are con-verted to unary (see Section 4.3.5), variables read from memory, operators perform on valuesand assignments write to memory, some clear distinction between them is needed.

19

Page 20: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

Table 4.3: Intermediate identifiers with corresponding token identifiers.

Intermediate identifier Token identifiervariable IDvalue INToperator ADD, SUB, MUL, DIVassignment EQU

4.3.3 Register allocation

The register allocation of variables is also done on runtime. Whenever the machine comes acrossa variable during execution, the machine checks its simple memory, made of a Python dictio-nary, and if the variable occurs in the memory it is read and printed to the tape. If there is nooccurrence of the variable in the memory, the variable will be declared and is set to a pendingstate until the solution of the expression is found. The result will be assigned to the pendingvariable.

4.3.4 Machine code generation

The parsing phase only returns if the syntax is valid and else it returns an error to the user. Inorder to go to the eventual execution, some management has to be done first. The machine codeis defined in a postfix notation5 that evolves from semantic actions embedded in the CFG. Theaction changes the infix notation in the grammar to a postfix notation, as can be seen in slide 53of [9]. In Figure 4.7 the CFG from Figure 4.6 is shown with only the lines where the actions areembedded in.

A→ id = E [EQU] (4.14)

E′ → + T [ADD] E′ (4.15)

E′ → − T [SUB] E′ (4.16)

T′ → ∗ F [MUL] T′ (4.17)

T′ → / F [DIV] T′ (4.18)

Figure 4.7: The LL(1) grammar with embedded semantic actions.

During the parsing phase these actions get also pushed onto the stack. Once an action, avariable or a value is the stack-symbol, it gets pushed to the postfix array. The postfix arraywill eventually hold the machine code for further use. The parse routine of the expressionVar = 1 + 1, using the grammar with embedded actions, can be found in Table 4.4.

5A notation for writing arithmetic expressions in which the operands appear before their operators. Source: http://www.cs.csi.cuny.edu/~zelikovi/csc326/data/assignment5.htm.

20

Page 21: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

Table 4.4: Top down parsing the sample expression Var = 1 + 1 with embedded semanticactions.

STACK INPUT OUTPUT POSTFIX$ S Var = 1 + 1 \n$ eof A Var = 1 + 1 \n S→ A eof$ eof [EQU] E = id Var = 1 + 1 \n A→ id = E [EQU]$ eof [EQU] E = = 1 + 1 \n match$ eof [EQU] E 1 + 1 \n match$ eof [EQU] E’ T 1 + 1 \n E→ T E’$ eof [EQU] E’ T’ F 1 + 1 \n T→ F T’$ eof [EQU] E’ T’ int 1 + 1 \n F→ int$ eof [EQU] E’ T’ +1 \n match and add 1$ eof [EQU] E’ +1 \n T’→ ε 1$ eof [EQU] E’ [ADD] T ADD +1 \n E’→ ADD T [ADD] E’ 1$ eof [EQU] E’ [ADD] T 1 \n match 1$ eof [EQU] E’ [ADD] T’ F 1 \n T→ F T’ 1$ eof [EQU] E’ [ADD] T’ int 1 \n F→ int 1$ eof [EQU] E’ [ADD] T’ \n match and add 1 1$ eof [EQU] E’ [ADD] \n T’→ ε 1 1$ eof [EQU] E’ \n add 1 1 [ADD]$ eof [EQU] \n E’→ ε 1 1 [ADD]$ eof \n add 1 1 [ADD] [EQU]$ match 1 1 [ADD] [EQU]

The result is 1 1 [ADD] [EQU] and is in postfix notation. The [EQU] function will eventuallyassign the result to a register in memory at execution time, this is why it is the last action forobvious reasons. The [ADD] action adds up the to digits in front, being 1 and 1. Further use ofthose actions as functions can be read in Section 4.4.

4.3.5 Assembly and linking

The intermediate code has to be translated to a machine specific language The machine is dis-cussed in Section 4.4. The machine makes use of an unary presentation to perform arithmeticoperations on. So essentially the machine code generation is the conversion of the decimal rep-resentation of numbers in the postfix notation to an unary representation. As a convention,the elements of the unary code are separated by a 0. So as an example: 2 3post f ix becomes110111unary. After this, the functions in the postfix code operate on the unary numbers.

4.4 Executing

A Turing Machine is used for the final execution of the code. The TM exists, like the PDA, ofa memory and an underlying automaton. However the memory is represented in tape form.An infinite tape with cells that are empty or can hold any symbol and with a $ sign (` in mostliterature, but harder to use on machines) indicating the start of the tape. For the purpose ofthis project only using 1’s and 0’s satisfies. The tape is filled with 0’s and the 1’s represent theunary numbers.A TM features a memory and a transition table, referred to as instruction set, which holds theunderlying automaton. A tape head moving above the tape can move from right(R) to left(L)or stay(N) where it is (other literature may have a moving tape instead). When the symbol isread from the tape, the TM finds the instruction at the column corresponding to the symboland the row corresponding to the state. The idea of performing arithmetic operations camefrom [2] and is adopted in the implementation with some alterations. An example of how toperform arithmetic operations on a tape using a Turing machine is demonstrated in Table 4.5and the instruction set of the ADD operation is found in Table 4.6. The boxed digit indicates thelocation of the tape head.

21

Page 22: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

Table 4.5: The ADD operation on a tape containing 110111.

Tape State Rule1 1 0 1 1 1 0 (0, L, 1)1 1 0 1 1 0 1 (1, L, 1)1 1 0 1 1 0 1 (1, L, 1)1 1 0 1 1 0 1 (0, L, 2)1 1 0 1 1 0 2 (1, R, 7)1 1 0 1 1 0 7 (1, R, 7)1 1 1 1 1 0 7 (1, R, 8)1 1 1 1 1 0 8 (1, R, 8)1 1 1 1 1 0 8 (0, L, 9)1 1 1 1 1 0 9 HALT

Table 4.6: The instruction set of an ADD operation

State 0 10 (0,L,3) (0,L,1)1 (0,L,2) (1,L,1)2 (1,R,4) (1,R,7)3 (0,L,9) (None)4 (1,R,5) (None)5 (0,L,6) (1,R,5)6 (None) (0,L,9)7 (1,R,7) (1,R,8)8 (0,L,9) (1,R,8)

The main difference in the Turing machine in this project, is the presence of an additionalmemory. Using an ordinary TM, the total input stream is written to the tape and operated on.However, with this implementation the input stream is read and written token by token. Foreach identifier, as seen in Table 4.3, there are four different procedures:

Variable A variable read from the stream gets looked up in memory. If it occurs in memory, thevalue assigned to it is written on the tape, if there is no occurrence, it goes to a pendingstate, waiting for an assignment from the [EQU] procedure.

Value Whenever a value is read it gets converted to an unary representation and written ontothe tape. If the value is a 0, the TM writes a 0 too (also with a 0 in front). The tape afterwriting a 2 and a 0 will look like the following: 1 1 0 0 .

Operator The corresponding instructions are read from the instruction set and executed on thecontents of the tape, being the last two values.

Assignment The assignment procedure is executed at the end of each expression. The tapecontains only one value which is the solution to the expression. The TM reads the unarynumber and assigns it to the pending variable in memory.

Note that after any procedure, the tape head should return to the last digit of the last value onthe tape.

4.4.1 The instruction set

In this section some more details on the instruction set are covered. The ADD operation inTable 4.6 may seem complicated and hard to read. The underlying automaton gives insightsand improves the readability, shown in Figure 4.8. The transition is indicated with a digit on

22

Page 23: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

the arrow and represents the character read by the tape head. The attached text indicates whichsymbol to write and where to move next respectively. The (None) instruction from the ADDoperation is not present in the automaton, since there is no transition for this combination. Theexistence of the (None) instruction emerged from the fact that there is no particular way a 1might occur in that stage of the execution.Not every rule of the ADD operation in Table 4.6 is used in the execution in Table 4.5. Theserules handle exceptions to the ordinary addition, where both operands are greater than zero.For example, reading a 0 on the first rule directs to the part where the tape head skips to thefirst operand (state 3). The tape head ends in state 9 on the first operand by skipping the 0separating the numbers. Another example is state 2 where the tape head is at the first operand.If the operand is 0, it gets replaced by a 1 in the transition to state 4 and the separating 0 getsreplaced by a 1 in the transition to state 5. Eventually the first 1 gets removed (6 to 9).One other possible character that might be read, is the $ sign. This happens when the tape headreaches the start of the tape. However, if the instructions are properly implemented, this wouldnot occur.

Figure 4.8: The ADD operation in automata form.

4.4.2 Objectives

The student should be able to create a running TM which reads the input source code, reads thetape and performs operations on the contents of the tape. For each procedure that is read fromthe source, an action is performed. The actions will be implemented by the student too.Beside the provided ADD instruction set, the student has to deliver the sets for the [SUB], [MUL]and [DIV] procedures. During this process the acquired knowledge is put into practice andclever thinking is expected. Operations involving a 0, like 3 ∗ 0 but also 0 ∗ 3, need to be treatedand must result in a correct solution. Thereby, the work space for this operations is on theinfinite side of the tape, so a straightforward solution could result in overwriting other contentson the tape.

23

Page 24: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

24

Page 25: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

CHAPTER 5

Implementation

In the implementation section, the eventual implementation is discussed. Mind that due to thepurpose of this project, namely creating assignments, some parts will be skipped or explainedin less detail, so no solutions to the assignments will be offered.Each program has its I/O management for reading and writing the source code and interme-diate code, along with appropriate error messages. The arithmetic expression source code isprovided in a text file and is read line by line. The intermediate codes are stored as objectsusing pickle1.

5.1 Assignment 1

Assignment 1 exists of two programs: lexer.py and parser.py. An additional program automata.pywill function as an FA by calling FA(Q, Sigma, delta, q0, F) (the formal definition of a FA). Thesource code is stored in a text file called main.prog and is used by the lexer.The FA object in automata.py consists of some basic methods:

Initialise Before creating an object, all arguments, provided by the FA() call, are checked onconsistency first. The q0 state and accepting states F have to be part of the set of states Qand each transition symbol has to be part of the alphabet Σ. If all arguments are correctlyspecified, the states are created. Each state has a name, a transition table, a boolean indi-cating if it is a start state and a boolean indicating whether it is an accepting state. Thenthe FA object is returned.

Move The move method takes an input symbol and checks whether a transition from the cur-rent state exist for the input symbol. If a transition exists, the state changes according tothe transition and returns True. If there is no transition for the input symbol, False getsreturned.

Reset The reset method resets the current state to the start state q0.

Plot Using graphviz2, a plot of the automaton is rendered and stored on the machine. This isuseful when checking the correctness of the automaton.

5.1.1 Lexer

Once main.prog is loaded, it gets divided into lines. The main function calls the lexer, which isan FA object. The assignment is to write a lexer(M, source) function, taking the FA object anda source code line as input. The function returns a tuple list with tokens and correspondingidentifiers, as explained in Section 4.1.To create the lexer, the arguments for the FA object need to be specified as shown in Listing 5.1.

1https://docs.Python.org/2/library/pickle.html2http://www.graphviz.org

25

Page 26: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

The listing creates the DFA from Figure 4.4. The last line eventually creates the object and thusthe lexer.

Listing 5.1: FA arguments for creating the lexer automaton1 Q = [’START’,’ID’,’INT’,’ADD’,’SUB’,’MUL’,’DIV’,’LBT’,’RBT’,’EQU’,’

↪→ WHITE’,’EOF’]

2 Sigma = [’<digit >’,’<character >’,’/’,’=’,’(’,’)’,’+’,’*’,’-’,’\n’,’

↪→ ’,’\t’]

3 delta = {

4 ’START’:{

5 ’<character >’:’ID’,

6 ’<digit >’:’INT’,

7 ’+’:’ADD’,

8 ’-’:’SUB’,

9 ’*’:’MUL’,

10 ’/’:’DIV’,

11 ’(’:’LBT’,

12 ’)’:’RBT’,

13 ’=’:’EQU’,

14 ’\n’:’EOF’,

15 ’ ’:’WHITE’,

16 ’\t’:’WHITE’},

17 ’ID’:{

18 ’<character >’:’ID’,

19 ’<digit >’:’ID’},

20 ’INT’:{

21 ’<digit >’:’INT’}

22 }

23 q0 = ’START ’

24 F = [’ID’,’INT’,’ADD’,’SUB’,’MUL’,’DIV’,’LBT’,’RBT’,’EQU’,’WHITE’,’

↪→ EOF’]

2526 M = FA(Q, Sigma , delta , q0 , F)

Once every token has its identifier, if no errors occurred, they are stored in a 2D array (tokensby lines). If all lines are tokenized the 2D array is stored with pickle as main.lex.

5.1.2 Parser

As explained in the first part of Section 4.2, a DFA can be used for parsing. The parser uses thesame FA object as the lexer and the arguments are specified in the same fashion as in Listing 5.1,following the structure of the DFA in Figure 4.5. The assignment is to take the input tokens fromthe main.lex file and parse them using the parser, created by calling the 5-tuple FA() function.Again the main function reads the file line by line and calls the parser. The student shouldimplement the parser function which follows the transition, based on the input token. Thestudent also implements an error handler with a message corresponding to the state. If thereis no possible transition from state 2, the error message might be: ’SyntaxError: Expected INT orID’. The program only returns a message that the file is successfully parsed. There is no parsedfile, due to the balanced parentheses problem and the lack of operator precedence, discussed inSection 4.2.

5.2 Assignment 2

For assignment 2, the file PDA.py has to be completed in order to parse the main.lex file properly.In addition, there is a CFG object provided in CFG.py. The object reads the CFG provided by theuser and creates a rule object for every left hand side symbol. The readGrammar() method takesthe grammar and checks the validity. The left hand side gets separated from the right hand sideand together with the rule number stored in a rule object.

26

Page 27: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

In the PDA.py file the student has to implement a stack object and a parse function. The stackobject should have an array to hold the symbols and two basic methods to push symbols onand pop symbols off the stack. Further method implementations are up to the student.The parse function takes the tuple array from main.lex as input and starts parsing. Together withthe CFG object and the created stack object, symbols can be pushed and popped by followingthe rules of the grammar. The student should also handle the postfix creation, described in Sec-tion 4.3.4, by following a provided type function which models Table 4.3. Based on the returnedvalue, the student should come up with which actions to take next.After parsing the file, the resulting postfix code is stored as main.exe.

5.3 Assignment 3

The last assignment exists of two files: TM.py and EXE.py. The TM.py file contains a TM objectprovided with functions and methods which have to be implemented by the student. EXE.pyis the main program which executes the main.exe file from the parse phase. Again the read fileis fed line by line to the TM object. The instruction sets are specified in the TM.py file in theexecutables function. This function returns the instruction set for the inputs: [ADD], [SUB],[MUL] and [DIV].In order to create a functioning TM, the TM object has a register holding the executable code, a(ram) memory to store variables and a tape. The register consists of a list and the memory usesa dictionary for easy access. There is a tape object that represents the infinite3 tape. In Listing5.2 the tape object is shown in code.

Listing 5.2: The infinite tape object1 class Tape(object):

2 def __init__(self , init):

3 self.default = 0

4 self.items = [init]

56 def get(self , index):

7 if index < len(self.items):

8 return self.items[index]

9 else:

10 return self.default

1112 def set(self , index , value):

13 if index >= len(self.items):

14 self.items.extend ([self.default ]*( index + 1 - len(self))

↪→ )

15 self.items[index] = value

1617 def __len__(self):

18 return len(self.items)

1920 def __repr__(self):

21 return ’ ’.join(map(str , self.items))

The init method initialises the tape by creating a list and setting the first element, beingthe $ sign in this case. The default element is the 0 digit, as mentioned in Section 4.4, the tapeis filled with 0’s. To accomplish this infinite tape, the get and set methods handle requests withindexes outside the range of the items list differently. When the index exceeds the range, thedefault digit (0) is returned by the get method. The set method extends the items list for an indexoutside the range with the default digit and then writes the value to it.The student has to implement a run method inside the TM object that handles the input fromthe register according to the described actions in Section 4.4. When an operator procedure is readfrom the register, an execute function is called, which has to be complemented by the student.

3Limited by the available hardware memory and depending on the machine.

27

Page 28: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

The function should execute the instruction set for each procedure. By reading the symbol be-low the tape head, the function takes the action from the instruction set corresponding to thesymbol and current state. The action is then performed on the tape, this includes a write, moveand state change action.Finally the instruction sets for the [SUB], [MUL] and [DIV] procedures are implemented. Thechallenge for this implementation is handling special cases including for example a 0 operand.If all works correctly, the eventual content on the tape is the solution to the arithmetic expres-sion.

28

Page 29: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

CHAPTER 6

Conclusion

This paper described how the practical use of AFL can be found in compiling and executingcode, in this case simple arithmetic expressions in the form of assignments. All three compo-nents mentioned in Section 1.2: Finite Automata, Push-Down Automata and the Turing Ma-chine, are covered in the study. The fundamental elements of each automaton are recreated andexerted in the compiling field, by using acquired knowledge from the student. This is donewithout overflowing the student with new information by restricting the implementation toAFL elements only and still giving an insight into the compiler design field. The study provesthat with some abstraction and advance work, compiling and executing are suitable to demon-strate the applicability of AFL.There are no results of the assignments in practice, however the assignments closely follow theliterature on AFL and are an extension on the subject more than an addition. Therefore the as-signments are feasible for the students and convenient for teachers and should not cause anyproblems.The use of Python worked well and was no issue, emphasising it was the only language con-sidered and not the goal of this study to find the optimal language to exert AFL subjects. Inaddition, the assignments can be implemented in any high-level programming language withits advantages and disadvantages for each.

6.1 Reflection

Although the study worked out as intended, some flaws came across. The whole process ofcompiling a simple arithmetic expression to an eventual solution seems redundant, since thereare other approaches on solving this issue. However, using a programming language instead ofthe expressions would have resulted in a more challenging assignment, but harder to apply tothe TM. Adding the use of variables in the expressions was a balance between these two.The constraints of the tape both obstructed and advanced the implementation. Since the taperesembles computer memory, only two symbols are allowed. This causes a higher complexityon number representation which resulted in using natural numbers only, which in turn resultedin a restriction of possible expressions. However, the advantage to this was the simplificationof the instruction sets which operate on the tape, since there is no need for handling negativenumbers and floating points.In retrospect, the assignments gave a basic understanding of how compilers might work. Theapproach is very powerful once the automaton and grammar are correctly defined. This meansthat the code does not have to change when using other specified languages, only the automa-ton and grammar need to be redefined. Keep in mind though that some structures, like theparse table, strongly depend on these automata and grammars. However, by following thesteps in the mentioned literature, these structures can easily be created, or even with specifictools on the web1.

1http://hackingoff.com/compilers/ll-1-parser-generator

29

Page 30: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

The execution on the TM successfully resulted in the solution to the expression. The approachgives an insight into how a processor might perform arithmetic operations on a data streamand shows the complexity of the operations. With the existing hardware, the execution ap-proach might seem redundant. But again, keep in mind that it illustrates the capabilities of theTM.

30

Page 31: Practical use of Automata and Formal Languages in the ... · Creating three assignments covering Finite Automata, Push-Down Automata and Turing Ma-chines to test acquired knowledge

Bibliography

[1] Anderson, J. A. (2006). Automata theory with modern applications. Cambridge UniversityPress.

[2] Cooper, C. (2013). Languages and machines. Notes for Math, 237.

[3] DuSell, B. (2014). pycfg. https://github.com/bdusell/pycfg/. Retrieved May, 2017.

[4] Han, Z. D., Kocijan, K., and Lopina, V. (2016). Python as pseudo language for formal lan-guage theory. In MIPRO 2016-Computers in education (CE).

[5] Kozen, D. C. (2012). Automata and computability. Springer Science & Business Media.

[6] Mogensen, T. Æ. (2009). Basics of compiler design. Torben Ægidius Mogensen.

[7] Perrin, D. (2003). Automata and formal languages. Universite de Marne-la-vallee, pages 1–22.

[8] Webber, A. B. (2008). Formal language: A practical introduction. Franklin, Beedle & AssociatesInc.

[9] Wu, F. (2012). Syntax-directed translation. http://slideplayer.com/slide/8701450/. Re-trieved May, 2017.

31