parsing by example
DESCRIPTION
Workshop at TU Delft on Dec 13, 2011TRANSCRIPT
Oscar NierstraszSoftware Composition Group
scg.unibe.ch
Parsing by Example
Monday, 12 December 11
2Monday, 12 December 11
3Monday, 12 December 11
4Monday, 12 December 11
5
Moose is a platform for software and data analysis
www.moosetechnology.org
Monday, 12 December 11
6
Model repository
The Story of Moose, ESEC/FSE 2005
Monday, 12 December 11
6
Navigation
Metrics
Querying
Grouping
Model repository
The Story of Moose, ESEC/FSE 2005
Monday, 12 December 11
6
Smalltalk
Navigation
Metrics
Querying
Grouping
Model repository
The Story of Moose, ESEC/FSE 2005
Monday, 12 December 11
6
Smalltalk
Navigation
Metrics
Querying
Grouping
CodeCrawler
ConAn Van ...Hapax
Model repository
The Story of Moose, ESEC/FSE 2005
Monday, 12 December 11
6
Smalltalk
Navigation
Metrics
Querying
Grouping
CodeCrawler
ConAn Van ...Hapax
Extensible meta model
Model repository
The Story of Moose, ESEC/FSE 2005
Monday, 12 December 11
6
Smalltalk
Navigation
Metrics
Querying
Grouping
Smalltalk
Java
C++
COBOL
…
CodeCrawler
ConAn Van ...Hapax
Extensible meta model
Model repository
The Story of Moose, ESEC/FSE 2005
Monday, 12 December 11
6
Smalltalk
Navigation
Metrics
Querying
Grouping
Smalltalk
Java
C++
COBOL
…
CDIF
XMI
ExternalParser
CodeCrawler
ConAn Van ...Hapax
Extensible meta model
Model repository
The Story of Moose, ESEC/FSE 2005
Monday, 12 December 11
7
System complexity
Monday, 12 December 11
8
Clone evolution
Monday, 12 December 11
9
Class blueprintMonday, 12 December 11
10
Topic correlation matrixMonday, 12 December 11
11
Distribution map(topics spread over classes in packages
Monday, 12 December 11
12
Hierarchy evolution view
Monday, 12 December 11
13
Ownership map
Monday, 12 December 11
14Monday, 12 December 11
15
Mondrian: An Agile Visualization Framework, SoftVis 2006
Monday, 12 December 11
16Monday, 12 December 11
17
Smalltalk
Navigation
Metrics
Querying
Grouping
Smalltalk
Java
C++
COBOL
…
CodeCrawler
ConAn Van ...Hapax
Extensible meta model
Model repository
Monday, 12 December 11
17
Smalltalk
Navigation
Metrics
Querying
Grouping
Smalltalk
Java
C++
COBOL
…
CodeCrawler
ConAn Van ...Hapax
Extensible meta model
Model repository
But, we have a huge bottleneck for new
languages ...
Monday, 12 December 11
18Monday, 12 December 11
19Monday, 12 December 11
20Monday, 12 December 11
21Monday, 12 December 11
22Monday, 12 December 11
23Monday, 12 December 11
24Monday, 12 December 11
25
Grammar Stealing
Monday, 12 December 11
26Monday, 12 December 11
27
stance). Figure 2 shows the first two cases—the third is just a combination. If you startwith a hard-coded grammar, you must re-verse-engineer it from the handwritten code.Fortunately, the comments of such code of-ten include BNF rules (Backus Naur Forms)indicating what the grammar comprises.Moreover, because compiler construction iswell-understood (there is a known referencearchitecture), compilers are often imple-mented with well-known implementation al-gorithms, such as a recursive descent algo-rithm. So, the quality of a hard-coded parserimplementation is usually good, in whichcase you can easily recover the grammarfrom the code, the comments, or both. Ex-cept in one case, the Perl language,14 thequality of the code we worked with was al-ways sufficient to recover the grammar.
If the parser is not hard-coded, it is gen-erated (the BNF branch in Figure 2), andsome BNF description of it must be in thecompiler source code. So, with a simple toolthat parses the BNF itself, we can parse theBNF of the language that resides in the com-piler in BNF notation, and then extract it.
When the compiler source code is notaccessible (we enter the Language Refer-ence Manual diamond in Figure 2), either areference manual exists or not. If it is avail-able, it could be either a compiler vendormanual or an official language standard.The language is explained either by exam-
ple, through general rules, or by both ap-proaches. If a manual uses general rules, itsquality is generally not good: referencemanuals and language standards are full oferrors. It is our experience that the myriaderrors are repairable. As an aside, we oncefailed to recover a grammar from the man-ual of a proprietary language for which thecompiler source code was also available(so this case is covered in the upper half ofFigure 2). As you can see in the coveragediagram, we have not found low-qualitylanguage reference manuals containinggeneral rules for cases where we did nothave access to the source code. That is, tobe successful, compiler vendors must pro-vide accurate and complete documenta-tion, even though they do not give awaytheir compilers’ source code for economicreasons. We discovered that the quality ofthose manuals is good enough to recoverthe grammar. This applies not only to com-piler-vendor manuals but also to all kindsof de facto and official language standards.
Unusual languages rarely have high-qualitymanuals: either none exists (for example, ifthe language is proprietary) or the companyhas only a few customers. In the proprietarycase, a company is using its in-house lan-guage and so has access to the source code;in the other case, outsiders can buy the codebecause its business value is not too high.For instance, when Wang went bankrupt, its
8 2 I E E E S O F T W A R E N o v e m b e r / D e c e m b e r 2 0 0 1
No
Yes
Yes
Start
Hard-codedparser
Recover thegrammar
Recover thegrammar
BNF
No cases known
Yes
One case:perl
No
Languagereferencemanual?
No
Generalrules
Recover thegrammar
One case:RPG
Constructionsby example
Yes
No casesknown
No
Compilersources?
Quality?
Quality?
Figure 2. Coverage diagram for grammar stealing.
Monday, 12 December 11
27
stance). Figure 2 shows the first two cases—the third is just a combination. If you startwith a hard-coded grammar, you must re-verse-engineer it from the handwritten code.Fortunately, the comments of such code of-ten include BNF rules (Backus Naur Forms)indicating what the grammar comprises.Moreover, because compiler construction iswell-understood (there is a known referencearchitecture), compilers are often imple-mented with well-known implementation al-gorithms, such as a recursive descent algo-rithm. So, the quality of a hard-coded parserimplementation is usually good, in whichcase you can easily recover the grammarfrom the code, the comments, or both. Ex-cept in one case, the Perl language,14 thequality of the code we worked with was al-ways sufficient to recover the grammar.
If the parser is not hard-coded, it is gen-erated (the BNF branch in Figure 2), andsome BNF description of it must be in thecompiler source code. So, with a simple toolthat parses the BNF itself, we can parse theBNF of the language that resides in the com-piler in BNF notation, and then extract it.
When the compiler source code is notaccessible (we enter the Language Refer-ence Manual diamond in Figure 2), either areference manual exists or not. If it is avail-able, it could be either a compiler vendormanual or an official language standard.The language is explained either by exam-
ple, through general rules, or by both ap-proaches. If a manual uses general rules, itsquality is generally not good: referencemanuals and language standards are full oferrors. It is our experience that the myriaderrors are repairable. As an aside, we oncefailed to recover a grammar from the man-ual of a proprietary language for which thecompiler source code was also available(so this case is covered in the upper half ofFigure 2). As you can see in the coveragediagram, we have not found low-qualitylanguage reference manuals containinggeneral rules for cases where we did nothave access to the source code. That is, tobe successful, compiler vendors must pro-vide accurate and complete documenta-tion, even though they do not give awaytheir compilers’ source code for economicreasons. We discovered that the quality ofthose manuals is good enough to recoverthe grammar. This applies not only to com-piler-vendor manuals but also to all kindsof de facto and official language standards.
Unusual languages rarely have high-qualitymanuals: either none exists (for example, ifthe language is proprietary) or the companyhas only a few customers. In the proprietarycase, a company is using its in-house lan-guage and so has access to the source code;in the other case, outsiders can buy the codebecause its business value is not too high.For instance, when Wang went bankrupt, its
8 2 I E E E S O F T W A R E N o v e m b e r / D e c e m b e r 2 0 0 1
No
Yes
Yes
Start
Hard-codedparser
Recover thegrammar
Recover thegrammar
BNF
No cases known
Yes
One case:perl
No
Languagereferencemanual?
No
Generalrules
Recover thegrammar
One case:RPG
Constructionsby example
Yes
No casesknown
No
Compilersources?
Quality?
Quality?
Figure 2. Coverage diagram for grammar stealing.
Still takes a couple of weeks and lots of expertise
Monday, 12 December 11
28
Recycling Trees
Daniel Langone. Recycling Trees: Mapping Eclipse ASTs to Moose Models.
Bachelor's thesis, University of Bern
of this phase will be a model of the Ruby software system. As the meta-modelis FAME compliant, also the model will be. Information about the ClassLoader,an instance responsible for loading Java classes, is covered in section 4.7.
The Fame framework automatically extracts a model from an instance of anEclipse AST. This instance corresponds to the instance of the Ruby plugin ASTrepresenting the software system. Automation is possible due to the fact thatwe defined the higher level mapping. Figure 2.1 reveals the need for the highermapping to be restored. In order to implement the next phase independentlyfrom the environment used in this phase we extracted the model into an MSEfile.
Figure 2.1: The dotted lines correspond to the extraction of a (meta-)model.The other arrows between the model and the software system hierarchy showwhich Java tower level corresponds to which meta-model tower element.
2.3 Model Mapping by Example phase
Our previously extracted model still contains platform dependent informationand thus is not a domain specific model for reverse engineering. It could beused by very specific or very generic reverse engineering tools, as it containsthe concrete syntax tree of the software system only. However such tools donot exist. In the Model Mapping by Example phase we want to transform themodel into a FAMIX compliant one. With such a format it will be easier to usein several software engineering tools.
The idea behind this approach relies on Parsing by Example [3]. Parsingby Example presents a semi-automatic way of mapping source code to domain
9
Monday, 12 December 11
29
1.Infer AST implementation from IDE plugin2.Extract metamodel from plugin3.Map model elements to FAMIX (Moose)
Monday, 12 December 11
30
Cool idea, but hard to make it work in practice
Monday, 12 December 11
31
Parsing by Example
Example-Driven Reconstruction of Software Models, CSMR 2007
Monday, 12 December 11
32
m..n
1Model
ParserSource code
1
import
2
specify examples
5parse
Example mappings
... := ... ... ... ... | ... ... ... ... | ... ... ... ...... := ... ... ... ... | ... ... ... ... | ... ... ... ...... := ... ... ... ... | ... ... ... ... | ... ... ... ...
Grammar
3infer
4
generate
5
6export
Monday, 12 December 11
33
CodeSnooper
Monday, 12 December 11
34Monday, 12 December 11
34Monday, 12 December 11
35
Markus Kobel. Parsing by Example. MSc, U Bern, April 2005.
result to that obtained with a robust parser we find that weonly miss five classes:
Precise Model Our ModelNumber of Model Classes 366 361Number of Abstract Classes 233 233
In a second iteration we give examples of methods in ab-stract and concrete classes as well as interfaces. This leadsto three separate grammars which cannot easily be mergedsince this would lead to an ambiguous grammar [12]. In-stead we generate three parsers and apply them in parallelto the source files. We now obtain the following results:
Precise Model Our ModelNumber of Model Classes 366 316Number of Abstract Classes 233 233Total Number Of Methods 1887 1648
In addition to the two files we could not parse earlier,we now have some problems due to (i) attributes being con-fused with methods, (ii) language constructs (like static)occurring in unexpected contexts, (iii) different kinds ofdefinitions of methods. Additional examples would help tosolve these problems.
In a third iteration we add examples to recognize at-tributes. Once again we obtain three parsers based on threesets of examples for abstract classes, concrete classes andinterfaces. We obtain the following results:
Precise Model Our ModelNumber of Model Classes 366 346Number of Abstract Classes 233 230Total Number Of Methods 1887 1780Total Number of Attributes 395 304
This process can be repeated to cover more and more ofthe subject language. The question on when to stop can beanswered with “When the results are good enough”. Goodenough in this context means when we have enough infor-mation for a specific reverse engineering task. For example,a “System Complexity View” [18] is a visualization used toobtain an initial impression of a legacy software system. Togenerate such a view we need to parse a significant numberof the classes, identify subclass relations, and establish thenumbers of methods and attributes of each class. Even if weparse only 80% of the code, we can still get an initial im-pression of the state of the system. If on the other we wouldwant to display a “Class Blueprint” [17], a semantically en-riched visualization of the internal structure of classes wewould need a refined grammar to extract more information.The “good enough” is thus given by the reverse engineeringgoals, which vary from case to case.
4.2 Ruby
As second case study we chose the language Ruby, be-cause it is quite different from Java and it has a non-trivialgrammar. We took the unit testing library distributed withRuby version 1.8.2 released at the end of 2004. This part ofthe library contains 22 files written in Ruby. We do not havea precise parser for Ruby that can generate a FAMIX model(actually, to our knowledge, for Ruby there is only one pre-cise parser, namely the Ruby interpreter itself). Instead weretrieve the reference model by inspecting the source codemanually.
In Ruby there are Classes and Modules. Modules arecollections of Methods and Constants. They cannot gen-erate instances. However they can be mixed into Classesand other Modules. A Module cannot inherit from anything.Modules also have the function of Namespaces. Ruby doesnot support Abstract Classes [22].
For the definition of the scanner tokens for identifiers andcomments we use the following regular expressions:
<IDENTIFIER>: [ a�zA�Z $ ] \w⇤ ( \? | \ ! ) ? ;<comment>: \# [ ˆ \ r \n ]⇤ <eol> ;
Using just 2 examples each of namespaces, classes,methods and attributes, we are able to parse 7 of the 22 files.
Precise Model 7 files Our ModelNumber ofNamespaces 8 6 6Number ofModel Classes 25 4 4Total Number ofMethods 247 26 26Total Number ofAttributes 136 9 9
Amongst the files we could not parse, there are 4 large filescontaining GUI code. If we ignore these files, we are ableto detect about 25% of the target elements.
There are two main reasons that so few files can be suc-cessfully parsed:
1. The comment character # occurs frequently in stringsand regular expressions, causing our simple-mindedscanner to fail. A better scanner would fix this prob-lem. With some simple preprocessing (removing anyhash character that occurs inside a string and removingall comments) we can improve recall to 65-85%.
2. Ruby offers a very rich syntax for control constructs,allowing the same keywords to occur in many differentpositions and contexts. One would need many moreexamples to recognize these constructs.
result to that obtained with a robust parser we find that weonly miss five classes:
Precise Model Our ModelNumber of Model Classes 366 361Number of Abstract Classes 233 233
In a second iteration we give examples of methods in ab-stract and concrete classes as well as interfaces. This leadsto three separate grammars which cannot easily be mergedsince this would lead to an ambiguous grammar [12]. In-stead we generate three parsers and apply them in parallelto the source files. We now obtain the following results:
Precise Model Our ModelNumber of Model Classes 366 316Number of Abstract Classes 233 233Total Number Of Methods 1887 1648
In addition to the two files we could not parse earlier,we now have some problems due to (i) attributes being con-fused with methods, (ii) language constructs (like static)occurring in unexpected contexts, (iii) different kinds ofdefinitions of methods. Additional examples would help tosolve these problems.
In a third iteration we add examples to recognize at-tributes. Once again we obtain three parsers based on threesets of examples for abstract classes, concrete classes andinterfaces. We obtain the following results:
Precise Model Our ModelNumber of Model Classes 366 346Number of Abstract Classes 233 230Total Number Of Methods 1887 1780Total Number of Attributes 395 304
This process can be repeated to cover more and more ofthe subject language. The question on when to stop can beanswered with “When the results are good enough”. Goodenough in this context means when we have enough infor-mation for a specific reverse engineering task. For example,a “System Complexity View” [18] is a visualization used toobtain an initial impression of a legacy software system. Togenerate such a view we need to parse a significant numberof the classes, identify subclass relations, and establish thenumbers of methods and attributes of each class. Even if weparse only 80% of the code, we can still get an initial im-pression of the state of the system. If on the other we wouldwant to display a “Class Blueprint” [17], a semantically en-riched visualization of the internal structure of classes wewould need a refined grammar to extract more information.The “good enough” is thus given by the reverse engineeringgoals, which vary from case to case.
4.2 Ruby
As second case study we chose the language Ruby, be-cause it is quite different from Java and it has a non-trivialgrammar. We took the unit testing library distributed withRuby version 1.8.2 released at the end of 2004. This part ofthe library contains 22 files written in Ruby. We do not havea precise parser for Ruby that can generate a FAMIX model(actually, to our knowledge, for Ruby there is only one pre-cise parser, namely the Ruby interpreter itself). Instead weretrieve the reference model by inspecting the source codemanually.
In Ruby there are Classes and Modules. Modules arecollections of Methods and Constants. They cannot gen-erate instances. However they can be mixed into Classesand other Modules. A Module cannot inherit from anything.Modules also have the function of Namespaces. Ruby doesnot support Abstract Classes [22].
For the definition of the scanner tokens for identifiers andcomments we use the following regular expressions:
<IDENTIFIER>: [ a�zA�Z $ ] \w⇤ ( \? | \ ! ) ? ;<comment>: \# [ ˆ \ r \n ]⇤ <eol> ;
Using just 2 examples each of namespaces, classes,methods and attributes, we are able to parse 7 of the 22 files.
Precise Model 7 files Our ModelNumber ofNamespaces 8 6 6Number ofModel Classes 25 4 4Total Number ofMethods 247 26 26Total Number ofAttributes 136 9 9
Amongst the files we could not parse, there are 4 large filescontaining GUI code. If we ignore these files, we are ableto detect about 25% of the target elements.
There are two main reasons that so few files can be suc-cessfully parsed:
1. The comment character # occurs frequently in stringsand regular expressions, causing our simple-mindedscanner to fail. A better scanner would fix this prob-lem. With some simple preprocessing (removing anyhash character that occurs inside a string and removingall comments) we can improve recall to 65-85%.
2. Ruby offers a very rich syntax for control constructs,allowing the same keywords to occur in many differentpositions and contexts. One would need many moreexamples to recognize these constructs.
JBoss case study
Ruby case study
Problems• Ambiguity• False positives• False negatives• Embedded languages
Monday, 12 December 11
36
Evolutionary Grammar Generation
Sandro De Zanet. Grammar Generation with Genetic Programming — Evolutionary Grammar Generation.
MSc, U Bern, July 2009.
4.1. TUNING PEGS FOR GENETIC PROGRAMMING 27
C
A B
C
A B
Figure 4.5: Insert a node
4.1.3 Fitness function
The fitness function is the most important part of the evolutionary algorithm by in-directly defining the envisioned goal. It determines the quality of PEGs which is arepresentation of the distance to the solution. Without an elaborate fitness function, thesearch will head in the wrong direction. There is also the risk of finding non-intendedsolutions or to be trapped in local extrema.
Normally the fitness function is defined as a number which stands for a better rating,the higher it is. We decided to define the fitness function the other way around: worsePEG have a higher fitness number; the fitness of 0 defines the best achievable solution.This makes sense because we use the size metric of a PEG which is worse the bigger aPEG gets. Furthermore, we have found a possible solution if all the characters of thesource code have been parsed. Hence we measure the number of characters to parsewhich is better the smaller it is.
For our problem we first implemented the most straightforward solution. We let eachPEG parse every source code file. The number of characters that it cannot parse isadded to the final fitness. There is an additional penalty for not being able to parse afile at all (parser fails on the first character). The reason behind this is that we wanted astronger differentiation between the grammars that could at least parse one characterand the completely useless grammars. In this way a better grammar is a grammar witha lower fitness, a grammar with fitness zero being one that can fully parse every sourcecode. We don’t allow negative fitness (the reason is explained later).
Monday, 12 December 11
37
18 CHAPTER 3. GENETIC PROGRAMMING
Since biological evolution starts from an existing population of species, we need tobootstrap an initial population before we can begin evolving it. This initial populationis generally a number of random individuals. These initial individuals usually don’tperform well, although some will already be a tad better than others. That is exactlywhat we need to get evolution going.
The final part is reproduction, i.e. to generate a new generation from the surviving pre-vious generation. For that purpose an evolutionary algorithm usually uses two typesof genetic operators: point mutation and crossover (We will refer to point mutations asmutations, although crossover is technically also a mutation). Mutations change anindividual in a random location to alter it slightly, thus generating new information.Crossover1 however, takes at least two individuals and cuts out part of one of them, toput it in the other individual(s). By only moving around information, Crossover doesnot introduce new information. Be aware that every modification of an individual hasto result in a new individual that is valid. Validity is very dependent on the searchspace - it generally means that fitness function as well as the genetic operators shouldbe applicable to a valid individual. A schematic view is shown in fig. 3.1.
generate new
random population
select most fit
individuals
generate new
population with
genetic operators
fit enough?
mutation crossover
Figure 3.1: Principles of an Evolutionary Algorithm
There are alternatives to rejecting a certain number of badly performing individualsper generation. To compute the new generation, one can generate new individualsfrom all individuals of the old generation. This would not result in an improvementsince the selection is completely random. Hence the parent individuals are selected
1Crossover in biology is the process of two parental chromosomes exchanging parts of genes in themeiosis (cell division for reproduction cells)
Monday, 12 December 11
26 CHAPTER 4. COMBINATION OF PEGS AND GENETIC PROGRAMMING
C
A B
A B
Figure 4.4: Push a node up
Insertion ensures diversity in graph depth. Graph depth is important for the emergenceof more complex structures. The deletion, on the other hand, puts a counterweightto it. This ensures that grammars do not grow too big and that is possible to simplifystructures.
4.1.2 Crossover
Crossover is the second, less often used modification. The term is borrowed from thebiological chromosomal crossover, where it describes the exchange of genes of twopaired up chromosomes (one from the mother and one from the father) in the processof the Meiosis 1.
Applied to grammars, this results in copying a subgraph of one grammar into anothergraph. If one has two grammars p1 and p2, a new grammar is created by selecting arandom node of p1 and adding it to random node of p2. Note that the parsers willnot be directly linked. Rather the subgraph of p1 is copied independently from itsparent.
The motivation for crossover is to preserve useful, more complex structures that havealready evolved. By combining two modestly performing parsers one can generate anew one that combines the useful parts of the parents into a parser that performs wellin the two places of the parents.
1Process of generating reproductive cells like sperms and eggs
38
4.1. TUNING PEGS FOR GENETIC PROGRAMMING 25
Figure 4.2: Add back link node
...
...
Figure 4.3: Delete a node
To ensure the evolvability of more complex parsers we need more complex mutations.After the initial population got sorted mostly only single character parsers were leftand couldn’t mutate to parsers with more nodes.
The following mutations add the possibility to insert nodes between the current parserand the root parser:
deletion The selected parser first moves all its children to the parent parser, thusreplacing itself by its children (fig. 4.4)
insertion The selected parser is replaced by a composite parser (sequence or choice).The selected parser is then added to the new parser. This results in the insertionof a new parser in between the selected parser and its parser. If the selectedparser is the root, the new parser becomes the new root. (fig. 4.5)
PEG mutation and crossover
24 CHAPTER 4. COMBINATION OF PEGS AND GENETIC PROGRAMMING
To transform PEGs to evolvable structures, some adaptions have to be made. Inthe definition by Bryan Ford [7] sequence and choice operators only have two childparsers. Longer sequences (or choose statements) can be built by nesting them. Tomake profitable changes more probable and to keep the number of nodes down (whichalso increases the execution speed as well) they have been generalized to an arbitrarylength.
First we will look at the modifications that are possible for grammars and are commonlyused in evolutionary algorithms: mutation and crossover.
4.1.1 Mutation
A mutation of a grammar is the modification of one of its randomly chosen nodes.Every parser node is different and therefore it has to change itself in a different wayaccording to its type. So for instance the character parser will mutate the character itparser. Similarly a range parser alters its range.
There are though some more common mutations that affect the structure of the gram-mar and are not dependent on the node. They will only affect not primitive and notunary parser nodes:
add child A new randomly generated parser will be inserted in the list of children (fig.4.1)
Figure 4.1: Add a node
add link Works similarly to adding a child. Although in this case the new parser is notrandomly generated but selected from one of the nodes of the already existingPEG. This results in a link to this parser (like a CFG rule, fig. 4.2)
remove child A randomly selected child will be removed. No effect, if there is onlyone child. Remark that we don’t allow composite parsers with no children, sincethey don’t constitute a valid grammar (4.3)
The mutation has no effect on unary or primitive parsers like the character parser.
4.1. TUNING PEGS FOR GENETIC PROGRAMMING 25
Figure 4.2: Add back link node
...
...
Figure 4.3: Delete a node
To ensure the evolvability of more complex parsers we need more complex mutations.After the initial population got sorted mostly only single character parsers were leftand couldn’t mutate to parsers with more nodes.
The following mutations add the possibility to insert nodes between the current parserand the root parser:
deletion The selected parser first moves all its children to the parent parser, thusreplacing itself by its children (fig. 4.4)
insertion The selected parser is replaced by a composite parser (sequence or choice).The selected parser is then added to the new parser. This results in the insertionof a new parser in between the selected parser and its parser. If the selectedparser is the root, the new parser becomes the new root. (fig. 4.5)
4.1. TUNING PEGS FOR GENETIC PROGRAMMING 27
C
A B
C
A B
Figure 4.5: Insert a node
4.1.3 Fitness function
The fitness function is the most important part of the evolutionary algorithm by in-directly defining the envisioned goal. It determines the quality of PEGs which is arepresentation of the distance to the solution. Without an elaborate fitness function, thesearch will head in the wrong direction. There is also the risk of finding non-intendedsolutions or to be trapped in local extrema.
Normally the fitness function is defined as a number which stands for a better rating,the higher it is. We decided to define the fitness function the other way around: worsePEG have a higher fitness number; the fitness of 0 defines the best achievable solution.This makes sense because we use the size metric of a PEG which is worse the bigger aPEG gets. Furthermore, we have found a possible solution if all the characters of thesource code have been parsed. Hence we measure the number of characters to parsewhich is better the smaller it is.
For our problem we first implemented the most straightforward solution. We let eachPEG parse every source code file. The number of characters that it cannot parse isadded to the final fitness. There is an additional penalty for not being able to parse afile at all (parser fails on the first character). The reason behind this is that we wanted astronger differentiation between the grammars that could at least parse one characterand the completely useless grammars. In this way a better grammar is a grammar witha lower fitness, a grammar with fitness zero being one that can fully parse every sourcecode. We don’t allow negative fitness (the reason is explained later).
Monday, 12 December 11
39
([a-z] (‘_‘ | [0-9] | [a-z])*)
(([a-z] ({‘\n‘ | ‘_‘ | [0-9]})*))*
0 -> (‘c‘ ‘a‘ ‘t‘ ‘:‘ ‘ ‘ ([a-z])+ 1 -> {2 -> (‘\n‘ 0) | e})
(0 -> (‘c‘ ‘a‘ ‘t‘ ‘:‘ ‘ ‘ 2 -> (([a-z])+ ‘\n‘)))+
0 -> (‘+‘ | ‘-‘ | ‘<‘ | ‘>‘ | ‘,‘ | ‘.‘ | 1 -> (‘[‘ 2 -> (0)* ‘]‘))
(0 -> {‘<‘ | ‘]‘ | ‘.‘ | ‘,‘ | ‘>‘ | ‘-‘ | ‘[‘ | ‘+‘})*
Desired grammar:
Found grammar:
Desired grammar:
Found grammar:
Desired grammar:
Found grammar:
Slow and expensive. Modest results for complex languages.
Monday, 12 December 11
40
What now?
Monday, 12 December 11
41
Exploit indentation as a proxy for structure
Monday, 12 December 11
42
Exploit indentation as a proxy for structure
Monday, 12 December 11
42
Exploit similarities between languages
(adapt and compose)
Exploit indentation as a proxy for structure
Monday, 12 December 11
43
Exploit similarities between languages
(adapt and compose)
Exploit indentation as a proxy for structure
Monday, 12 December 11
43
Incrementally refine island grammars
Exploit similarities between languages
(adapt and compose)
Exploit indentation as a proxy for structure
Monday, 12 December 11
43
Incrementally refine island grammars
Exploit similarities between languages
(adapt and compose)
Exploit indentation as a proxy for structure
Ideas? ...Monday, 12 December 11