modularization of text-to-model mapping specifications - a feasibility study using scannerless...

80
Modularization of Text-to-Model Mapping Specifications A Feasibility Study Using Scannerless Parsing Diploma Thesis at the Institute for Program Structures und Data Organization Chair for Software Design and Quality Prof. Dr. Ralf H. Reussner Fakult¨atf¨ ur Informatik Karlsruhe Institute of Technology by cand. inform. Martin K¨ uster Advisor: Prof. Dr. Ralf H. Reussner Dipl.-Inform. (FH) Thomas Goldschmidt Date of Registration: 2009-04-20 Date of Submission: 2009-11-02 Chair for Software Design and Quality X=1.00 X=0.01

Upload: buttermannhavel4778

Post on 29-Jul-2015

25 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

Modularization of Text-to-ModelMapping Specifications

A Feasibility Study Using Scannerless Parsing

Diploma Thesis at theInstitute for Program Structures und Data Organization

Chair for Software Design and QualityProf. Dr. Ralf H. Reussner

Fakultat fur InformatikKarlsruhe Institute of Technology

by

cand. inform.Martin Kuster

Advisor:

Prof. Dr. Ralf H. ReussnerDipl.-Inform. (FH) Thomas Goldschmidt

Date of Registration: 2009-04-20Date of Submission: 2009-11-02

Chair for Software Design and QualityXperf=1.00Xloss=0.01

Page 2: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing
Page 3: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

I declare that I have developed and written the enclosed Diploma Thesis completelyby myself, and have not used sources or means without declaration in the text.

Karlsruhe, 2009-11-02

Page 4: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing
Page 5: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

Abstract

Domain-specific languages (DSLs) are developed for a specific concern and limitedby nature. Ideally, DSLs and tools generated for them can be easily combined forreuse. When using textual concrete syntax for DSLs, the editing framework must beaware of the language composition. This involves modularizing the mapping betweenabstract and concrete syntax and combining the lexical and syntactic analyzers fromboth language toolkits. We present a generic generation of a scannerless parser thatavoids lexical conflicts, especially keyword pollution. For that purpose, the existingtextual modeling framework FURCAS employs the Rats! parser generator to createdomain-parsers. This serves to evaluate the feasibility of migrating to scannerlessparsing in order to facilitate flexible language composites.

Zusammenfassung

Domanenspezifische Sprachen (DSLs) werden fur einen speziellen Anwendungsbe-reich entwicklt und sind daher von Natur aus beschrankt. Idealerweise konnenDSLs und die dafur entwickelten Werkzeuge einfach kombiniert und wiederverwen-det werden. Beim Einsatz textueller Syntaxen fur DSLs muss das Editor-Frameworkdas Komposit der neuen Sprachen unterstutzen. Dies erfordert Moglichkeiten zurModularisierung von Text-zu-Modell Abbildungsbeschreibungen sowie das Zusam-menfuhren der beiden Komponenten zur lexikalischen und syntaktischen Analyse.Wir stellen die generische Erzeugung eines scannerfreien Parsers vor, der lexikali-sche Konlikte, inbesondere durch importierte Schlusselworter, vermeidet. Zu diesemZweck verwendet das bestehende textuelle Editor-Framework FURCAS den Rats!Parsergenerator, um Domanenparser zu erstellen. Dies dient einer Machbarkeits-analyse der Migration zu scannerfreiem Parsen, um flexible Sprachkompositionen zuermoglichen.

Page 6: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing
Page 7: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

Contents

1 Introduction 1

1.1 Textual Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1

1.2 Setting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

1.3 The Language Composition Vision . . . . . . . . . . . . . . . . . . . . . 4

1.4 Goal . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

1.5 Outline of the Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2 Analysis 7

2.1 Problem Statement . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.1.1 Keyword Pollution and Scannerless Parsing . . . . . . . . . . . . 7

2.1.2 Parsing Techniques and Grammar Classes . . . . . . . . . . . . . 9

2.2 TCS Mapping Language . . . . . . . . . . . . . . . . . . . . . . . . . . 11

2.2.1 An Introductory Example . . . . . . . . . . . . . . . . . . . . . 12

2.2.2 tcs.ConcreteSyntax . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.3 tcs.Template . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

2.2.4 tcs.ClassTemplate . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.2.5 tcs.Sequence and tcs.SequenceElement . . . . . . . . . . . . . . 16

2.2.6 tcs.Property . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

2.2.7 tcs.Operators and OperatorTemplate . . . . . . . . . . . . . . . 19

2.2.8 tcs.FunctionTemplate . . . . . . . . . . . . . . . . . . . . . . . 20

2.2.9 tcs.EnumerationTemplate . . . . . . . . . . . . . . . . . . . . . 20

2.3 Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

2.4 Related Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

Page 8: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

viii Contents

3 Design 23

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.1.1 Technological Overview . . . . . . . . . . . . . . . . . . . . . . 24

3.1.2 Higher-Order Transformation Approach . . . . . . . . . . . . . . 25

3.1.3 Bootstrapping the TCS . . . . . . . . . . . . . . . . . . . . . . 26

3.2 TCS Modifications for Composition . . . . . . . . . . . . . . . . . . . . 26

3.3 TCS-to-Grammar Transformation . . . . . . . . . . . . . . . . . . . . . 29

3.3.1 Concrete Syntax to Grammar . . . . . . . . . . . . . . . . . . . 30

3.3.2 Class Templates to Productions . . . . . . . . . . . . . . . . . . 30

3.3.3 Operator Templates to Productions . . . . . . . . . . . . . . . . 34

3.3.4 FunctionTemplate to Production . . . . . . . . . . . . . . . . . 36

3.3.5 EnumerationTemplate to Production . . . . . . . . . . . . . . . 37

3.3.6 tcs.Sequence to xtc.Sequence . . . . . . . . . . . . . . . . . . . 37

3.3.7 Keywords and Symbols to Productions . . . . . . . . . . . . . . 40

4 Implementation 41

4.1 Handler-Based Transformation . . . . . . . . . . . . . . . . . . . . . . . 42

4.2 Packrat Parser Specifics . . . . . . . . . . . . . . . . . . . . . . . . . . 42

4.2.1 Memoization . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2.2 Parser Optimizations . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.3 Actions and Bindings . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.4 Parameterized Rules . . . . . . . . . . . . . . . . . . . . . . . . 47

4.3 Lightweight Nested Transactions . . . . . . . . . . . . . . . . . . . . . 47

4.4 Ambiguities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48

4.4.1 Greedy Parse - Shift/Reduce Conflicts . . . . . . . . . . . . . . . 49

4.4.2 Ordering of Choices . . . . . . . . . . . . . . . . . . . . . . . . 51

4.4.3 A Heuristic for Shadowed Alternatives . . . . . . . . . . . . . . . 53

4.5 Tokenization and Scannerless Parsing . . . . . . . . . . . . . . . . . . . 54

4.5.1 White Space Definition . . . . . . . . . . . . . . . . . . . . . . 55

4.5.2 Assignment of Token Types . . . . . . . . . . . . . . . . . . . . 56

4.5.3 Tokenizing via Lightweight Transactions . . . . . . . . . . . . . 56

4.6 Challenges . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.6.1 Incrementality . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

4.6.2 Error Handling . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6.3 Error Recovery . . . . . . . . . . . . . . . . . . . . . . . . . . . 59

4.6.4 From Embedding to Composition . . . . . . . . . . . . . . . . . 59

Page 9: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

Contents ix

5 Summary and Conclusions 61

Bibliography 63

Page 10: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing
Page 11: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

List of Figures

1.1 Components of a CTS framework . . . . . . . . . . . . . . . . . . . . 3

2.1 Model of a compiler front end . . . . . . . . . . . . . . . . . . . . . . 8

2.2 Overview of Rats!’ module modification syntax . . . . . . . . . . . . 11

2.3 MOF diagram of TCS element ConcreteSyntax . . . . . . . . . . . . 13

2.4 MOF diagram of TCS element Templates . . . . . . . . . . . . . . . . 14

2.5 MOF diagram of TCS element ClassTemplate . . . . . . . . . . . . . 15

2.6 MOF diagram of TCS element Sequence . . . . . . . . . . . . . . . . 16

2.7 MOF diagram of TCS element Expression . . . . . . . . . . . . . . . 16

2.8 MOF diagram of TCS element SequenceElement . . . . . . . . . . . . 17

2.9 MOF diagram of TCS element Property . . . . . . . . . . . . . . . . 18

2.10 MOF diagram of TCS element OperatorList . . . . . . . . . . . . . . 19

2.11 MOF diagram of TCS element FunctionTemplate . . . . . . . . . . . 20

2.12 MOF diagram of TCS element EnumerationTemplate . . . . . . . . . 21

3.1 Architecture Overview . . . . . . . . . . . . . . . . . . . . . . . . . . 24

3.2 TCS to grammar transformation . . . . . . . . . . . . . . . . . . . . . 25

3.3 Using the legacy TCS parser to create a TCS instance . . . . . . . . . 27

3.4 MOF diagram of tcs.ConcreteSyntax after adding import . . . . . . . 27

4.1 Handler-based transformation of TCS instances . . . . . . . . . . . . 42

4.2 Rats parser optimizations . . . . . . . . . . . . . . . . . . . . . . . . 46

4.3 MOF class diagram of xtc.parser.InjectorState . . . . . . . . . . . . . 48

4.4 Sample metamodel illustrating operatored expressions . . . . . . . . . 50

4.5 MOF extract: namespaces and classifiers . . . . . . . . . . . . . . . . 51

4.6 Parse tree for input ”PrimitiveTypes::String” . . . . . . . . . . . . . . 52

4.7 DFA recognizing keyword followed by identifier . . . . . . . . . . . . 55

Page 12: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing
Page 13: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

Listings

2.1 Identifier-Keyword conflict . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2 TCS introductory example (from TCS.tcs) . . . . . . . . . . . . . . . . 12

3.1 Extract of the modified concrete syntax specification for TCS . . . . . . 28

4.1 Memoization example - TCS mapping snippet . . . . . . . . . . . . . . 43

4.2 Memoization example - simplified generated grammar . . . . . . . . . . 43

4.3 Memoization example - simplified parser code . . . . . . . . . . . . . . . 44

4.4 TCS.tcs: concrete syntax for classifiers and namespaces . . . . . . . . . 50

4.5 TCS.tcs: conditionals and expressions . . . . . . . . . . . . . . . . . . . 53

Page 14: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing
Page 15: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

1. Introduction

Model-driven engineering (MDE) or model-driven software development (MDSD)regard models as first-class entities. Instead of merely describing or documenting aprocess they serve as the central entity driving the process. Models are used in everystage of a development process - not only the design phase - and may range fromhigh-level to very specific. Model-to-model transformations, optionally augmentedwith handwritten code, and are used to specify the relation between models ondifferent levels of abstraction.

Domain-specific languages (DSLs) follow the paradigm that code is a model, too. Ina circumscribed and well-understood application domain they are used to describeentities, relationships and behavior. The fundamental difference to general-purposelanguages is that they do not strive for universality. This implies that if used inanother setting, the languages describing the entities are combined rather than ex-tended to avoid the evolution of DSLs to large, monolithic languages because thiswould nullify the advantage of domain-specificity in the first place.

Tools are crucial for high productivity in model-centric development and for a higheracceptance of model-driven processes. This conclusion is justified by the revolution-ary increase in productivity caused by the availability of integrated developmentenvironments (IDEs) for programming languages. MoNeT, short for modeling needstools, is a project at SAP AG that aims to provide better tools for modeling. Thepresented work was devised within this project as a cooperation of Forschungszen-trum Informatik (FZI), Karlsruhe, and SAP.

1.1 Textual Modeling

Domain-specific languages are defined by an abstract syntax and one (or many)concrete syntaxes. Similar to abstract syntax trees of general-purpose languages(GPLs), the abstract syntax is a representation of a language artifact on a level thatis independent from the artifact’s textual or graphical appearance. So a concretesyntax must be given to allow for creation and editing of domain-specific code.

Page 16: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

2 1. Introduction

Graphical editors have been around for more than a decade, mostly promoted by thegeneral-purpose modeling language UML1 and its various diagram types. One majoradvantage of graphical concrete syntaxes for DSLs is their well-defined interface foroperations creating, changing or deleting model elements. This allows to make localchanges to models, which is not trivial in textual editors as it requires incrementalityof the editor framework.

However, textual syntaxes bring forth several advantages. [GKR+07] as a positionpaper, points out ten items summarized in the following:

• Information content Graphical representations of large, complex modelsturn out to exceed what can be grasped by a developer.

• Speed of creation Especially for experienced users, a graphical editor thatrequires numerous mouse clicks impedes rapid creation and evolution of mod-els.

• Integration of languages Conditions and actions attached to graphical rep-resentations are textual already, but often badly integrated. Complete textualrepresentations are considered more productive.

• Speed and quality of formatting Formatting algorithms for graphical mod-els are doubted to be as effective as textual formatters because good graphicallayout cannot be guaranteed automatically without taking into account themodel semantics.

• Platform and tool independence Text can be edited with any editor. Re-fraining from convenient additional functionality (syntax highlighting, codecompletion etc.) this can be done without a special tool.

• Version control Text can be shared in repositories easily since this task iswell-understood and supported by all versioning systems. Methods to compare,replace or merge text are much simpler than comparing graphical representa-tions.

• Editors (almost) for free Features like syntax-highlighting and code com-pletion need to be put on top of existing textual editing environments.

• Outlines and graphical overviews Textual syntaxes can be used to createoutlines in a generic way allowing to take advantage of a graphical represen-tation as a view on the textual syntax.

• Parsers, pretty print, code generators and translators are rather eas-ily developed These components can be derived generically from the concretesyntax. (Note: This advantage is specific to the MontiCore environment inwhich abstract syntax definitions are is derived from the grammar definition)

• Composition of modeling languages (Note: Shared symbol tables and at-tribute grammars are claimed to promote language composition. We doubtthat this approach prevents from the general problem of keyword pollution asdetailed in Sec. 2.1.1)

1Unified Modeling Language specified by the Object Management Group. See http://www.omg.org/spec/UML/2.0/

Page 17: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

1.2. Setting 3

1.2 Setting

Emitter

Text Artefact Model

emits

parses creates / updates

reads

Generated CTS Tools

Mapping Definition

Grammar MetamodelCTS Framework

readsreferencesreferences

generates

Editor

instance ofinstance of

Reads/manipulates

artefact

r/w accessdependency

Legend

active comp.

“instance of“ rel.

communication

Reads/manipulates

Lexer Parser Sem. Anal.

Figure 1.1: Components of a concrete textual syntax framework (from [GBU08])

Fig. 1.1 depicts the general design of a concrete textual framework which is basicallycommon to all frameworks that support textual editing of domain-specific languages.

Grammar

Grammars are used to specify how well-formed textual artifacts look like. In allpractically relevant cases, these grammars are context-free in the language-theoreticsense allowing to be specified with the Extended Backus-Naur Form (EBNF). Socode (as a textual artifact) must obey the syntactic rules declared in the grammar.Some aspects of language-theoretic consequences are discussed in Sec. 2.1.2.

There is a second notion of grammars in this context: a grammar serves as inputfor a parser-generator. Language-independence of the CTS framework can onlybe achieved by generating the domain-parser automatically from the abstract andconcrete syntax definition. Details of the parser-specific grammar format employedhere can be found in Sec. 3.3.

Metamodel (=Abstract Syntax Model)

The abstract syntax is defined as a metamodel, i.e. described with elements from ameta-metamodel (such as MOF, Ecore, KM3). Its elements are similar to possiblenodes of an abstract syntax tree (AST) known from compiler techniques. Unlesscarrying semantic information, specific details of the expected concrete syntax shouldnot be included in the metamodel. The resulting gap between what is textuallypresent and how it should be represented in the abstract syntax must be covered inthe mapping definition.

Mapping Definition

The mapping defines a bridge between abstract and concrete textual syntax. Thisbridge may be specified as part of the framework (advancing rapid development of

Page 18: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

4 1. Introduction

a new language) or as a separate artifact (enhancing flexibility and expressivenessof the targeted languages).

Bridging concrete and abstract syntax can be tackled from two ends:

• EBNF-like mappings are very close to the expected concrete syntax. Theydefine rules that can be easily transformed into a format acceptable for aparser generator.

• Template-manner mappings defining the expected concrete syntax for eachmetamodel element. Prettyprinting an instance of the abstract model is easierwith this approach. In [JBK06], details of a representative of this approachare presented.

A modified version of the syntax definition language in [JBK06] is employedin the CTS framework discussed here. We therefore highlight its syntax andsemantics in Sec. 2.2.

Both approaches come with drawbacks. The higher abstraction incorporated inthe syntax metamodel creates a gap to the concrete textual syntax defined by agrammar. Depending on the choice of one or the other approach, the syntax editingframework must close this gap without outside information to be both flexible andexpressive.

Lexer/Parser

One of the most central tools generated by the CTS framework is the parser fortextual artifacts (code). It is triggered by the textual editor upon changes and up-dates the syntax model accordingly. The editor is responsible for all user-interactionand must hide the model-based nature of the syntax from the programmer. Asall generated tools are language-dependent, they need to be re-generated when themetamodel or the syntax mapping is changed.

1.3 The Language Composition Vision

Flexibly composing and embedding languages has been longed for for quite sometime. Recently this topic has earned new attention in the context of domain-specificlanguages. DSLs are mostly used as silo development units, independently fromother domains. Composability of DSLs is a big challenge and currently investigatedby several research teams.

In the domain of modeling, a prime representative for language composition is em-bedding constructs from the object constraint language OCL2 into some domain-specific language D. The problem of composition is three-fold:

• Integrating the type system and structure of the two languages (affecting theabstract syntax)

• Combining the concrete syntax (affecting the domain-lexer and -parser)

2see http://www.omg.org/spec/OCL/2.0/

Page 19: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

1.3. The Language Composition Vision 5

• Reconciling the two textual editors (with respect to syntax highlighting, codecompletion, ...)

The domain-specific paradigm implies that all artifacts and tools for both languagesare present, i.e. we have a lexer, a parser, an editor, an abstract syntax model and amapping for both languages. Now composing the two implies that, first, the abstractsyntax for D has to somehow reference the OCL-statement element. Second, theconcrete syntax needs to know where the textual representation of the OCL queryis expected.

If we consider analyzing a textual artifact conforming to D with the OCL embeddingwe need to answer a number of fundamental questions:

Question 1: How do we handle lexical conflicts?

A lexical conflict arises during the lexical analysis phase, before parsing an input.Let’s say we have a keyword from OCL that matches an identifier name in the code(such as context). We cannot tell whether this is acceptable and may be handledlater in the parsing phase or whether it must be prohibited. This is due to thefact that lexers are finite automata that do not take into account the context ofa construct. The latter solution (prohibiting) would imply restricting all keywordsfrom all imported languages (plus their closure). The outcome of a parse thendepends on which import statements were made, even if no constructs from theimported language are used.

A alternative way to tackle this is to tell the lexical analyzer about the importedkeywords. The automaton can then assign an indefinite type instead of either key-word or identifier. This will have an effect on the parsing phase as indefinite tokensmust be accepted where either a keyword or an identifier is expected.

Question 2: Can we keep the toolkits separate with slight modifications or do wehave to create a compound parser/lexer pair?

Although re-generating all tools with the modified concrete and abstract syntaxdefinitions seems tempting, it violates a fundamental paradigm in software engi-neering: don’t repeat yourself (DRY). For each combination of two languages we geta compound toolkit, which is clearly undesirable.

So let’s assume we can access the correct methods from the other parser to startanalyzing a construct accordingly. Then the next question arises considering scopes.

Question 3: How can we make the two analyzers know each other’s variables?

Here we have the first issue that points to the fact that composing is more than em-bedding. Say we declared and initialized a reference to a model element outside theOCL query. This reference should be available inside the query to allow for realisticand complex queries. Claiming that the parser toolkits need to be kept separatewe see that an interface-like mechanism is needed consolidating the constructs inquestion.

The work at hand does not claim to answer all questions involving composition ofdomain-specific languages. This is why the section is entitled language compositionvision. Instead, the specific subquestions arising from lexical conflicts are specificallyfocused on.

Page 20: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

6 1. Introduction

1.4 Goal

We aim to design and implement or modify a transformation framework that targetslanguage composites. By using a parser-generator framework that works withoutprior lexical analysis phase we tackle the issue of identifiers conflicting with keywordsfrom an imported language. This involves the following sub items:

• Review available parser generator technologies with respect to expressivenessand support of scannerless parsing.

• Select a scannerless parser generator meeting the requirements given by theconcrete textual editing framework.

• Highlight the mapping definition’s specifics and point out critical points of thetransformation to a grammar.

• Compare the grammar formats (existing and new parser generator) and devisea mechanism to auto-generate a grammar from a metamodel and the specifiedsyntax mapping.

• Implement and test the transformation and validate the output.

• Bootstrap the DSL describing syntax mappings.

• Implement a mechanism substituting the lexical analysis phase, i.e. emulatinga tokenizer.

• Discuss requirements for an editor-integration. Point out consequences forincremental parsing and error reporting.

• Estimate the feasibility for a complete migration to scannerless parsing in theCTS framework.

1.5 Outline of the Thesis

The work is structured as follows. Ch. 2 investigates problems with the existingtextual editing framework and states the detailed problem and the involved tech-niques and languages. Ch. 3 presents, on an implementation-independent level, thetransformation of a textual syntax specification to the selected grammar format.Ch. 4 gives details about the implementation of the transformation. Crucial aspectsarising from the migration to a scannerless parsing technique are discussed. An al-gorithm substituting the lexer phase by a token creation at parse-time is presented.Ch. 5 gives a brief summary and outlook on the feasibility of scannerless parsing inthe context of textual modeling.

Page 21: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

2. Analysis

Modularizing domain-specific languages must be tackled from both the abstractand the concrete syntax definition. Support for modularization on the abstractsyntax level is fairly straightforward. Metamodeling the abstract syntax allows to usenamespaces and key attributes (Ecore) or unique IDs (MOF) to identify referencedelements from more than one language specification even if there are name clashes.

The same is not true for specifications of concrete syntax, i.e. definitions of themapping between abstract and concrete syntax. One key concept of textual concretesyntaxes is to hide things like namespaces and model element resolution from theprogrammer. However, this implies that the editor framework has to take care ofthe disambiguation of name clashes and related issues.

This chapter explores requirements and questions that arise from composing lan-guages on the level of textual concrete syntax.

2.1 Problem Statement

The existing technology supporting textual editors (FURCAS) suffers from the prob-lem that it only works on monolithic language definitions: a strict one-to-one relationbetween abstract and concrete syntax definition on the one hand and a parser forthe textual artifacts on the other hand. Reuse of already specified languages and ofgenerated components is only possible by copying and pasting the concrete syntaxdefinition into the new language and re-generating, leading to undesirable effects in-cluding duplication of code, redundancy, large and complex specifications et cetera.This causes double maintenance as a known issue.

Supporting modularity on the textual syntax level requires small changes to the DSLthat specifies the mapping between concrete and abstract syntax (TCS ) but hasmore impact on the generation of a parser for the combined language because theexisting parsers cannot be glued together easily.

2.1.1 Keyword Pollution and Scannerless Parsing

Traditional compiler front ends can be modeled as depicted in Fig. 2.1. The first

Page 22: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

8 2. Analysis

Lexical Analyzer Parser

Intermediate Code

Generator

Symbol Table

source

program

tokens three-adress

code

syntax

tree

Figure 2.1: A model of a compiler front end (from [ASU86])

step, which is called lexical analysis or scanning or tokenizing phase, is accountablefor separating the input character stream in meaningful units so called lexemes.Furthermore, these lexemes are assigned abstract token types such as identifier,relational operator or a keyword as defined by the grammar specification.

Specifically, lexical rules for identifiers are usually defined by regular expressions (forexample: a character followed by an arbitrary sequence of characters and digits).When the lexical rules for identifiers are applied to keywords, they meet the criterionas well. That is why keywords declared in the language are stored in a table. Beforedeciding on whether a lexeme is a keyword or an identifier this table is looked up.This can be implemented easily and is totally sufficient for non-composed languagesas long as the set of reserved words is stable.

When putting two language definitions together, the problem arises that an iden-tifier might match the keyword rule from the imported or embedding language.Consider the well-known SQL language standard in which over 200 words are re-served1. Chances are likely that the embedding language comes into conflict withone of these keywords. The conflict can be inherent (identical keywords in bothlanguages with different semantics) or can depend on the source input (identifiersused in the source code identical to some keyword). This phenomenon is referred toas keyword pollution.

Instead of determining the token type of a lexeme in a stage prior to syntactic anal-ysis (parsing), information from the syntactic analysis can be used to tell whethera lexeme is an identifier or a keyword. Consider the compound statement stated inListing 2.1. Here, we have a conflict between the boolean identifier select and thereserved SQL word. It is obvious that the context where the lexeme select occurshelps to distinguish the two cases. Assume that the only location where an SQLconstruct is expected is in the argument list of the constructor of SQLStatement .Then there is no doubt that the first and second occurrence (lines 1 and 2) must bean identifier and the third occurrence (line 3) must be a keyword.

1 boolean select = getStatus();2 if ( select ) {3 Statement s = new SQLStatement(select ∗ from table 1;)4 }

Listing 2.1: Identifier-Keyword conflict

1depending on the SQL version. It is 295 for SQL1999 as specified by ISO/IEC 9075 seehttp://www.iso.org

Page 23: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

2.1. Problem Statement 9

Lexing without syntactic information cannot discriminate between the two! Thisleads to the conclusion that the assignment of token types to lexemes must bepostponed to the syntactic analysis. Integrating lexing with parsing is usually calledscannerless or token-free parsing. The term was coined by Salomon and Cormack in[SC89]. Based on this work, scannerless parsing has received considerable attentionover the past years ([Vis97], [For02], [Gri06], [KKV08]). The work at hand seeks totake advantage from the scannerless technique, too.

So summing up, the first requirement that a parser generator used for languagecomposition must meet is scannerlessness in contrast to ANTLR employed up tonow.

While the absence of a separate lexing phase is a more technical question whichshould not affect the amount of languages covered by the framework, grammarclasses are affected by the parsing technology. These are highlighted in the following.

2.1.2 Parsing Techniques and Grammar Classes

Almost all construts from programming languages can be described by context-freegrammars (CFGs). The context-free language class is thus theoretically best-suited.However, due to practical considerations proper subsets such as LL(k) or LR(k)have been much more relevant. Theoretical results show [RS83] that parsing anycontext-free sentence with well-known algorithms such as the Cocke-Younger-Kasamialgorithm or Earley’s algorithm take space n2 and time n3 for an input string of sizen. Non-tabular, backtracking algorithms may even take exponential time and linearspace. These complexities are unacceptable for practical settings and call for moreefficient solutions.

The two equally relevant grammar classes are those accepted by top-down parsers(LL) and bottom-up parsers (LR) and variants, both with fixed lookahead k. Theyare not discussed here in detail. The reader is referred to [AU72].

Generalized LR Parsing

Despite the unpleasant result that the full class of context-free languages is hard toparse, CFGs are specifically interesting for composition of languages. In contrast toother grammar classes, CFGs are closed under composition [Bra08]. That is whymuch effort has been put into generalized LR parsers that fork on non-determinismin the parse table and construct parse forests instead of parse trees leading to ac-ceptance of all CFGs, [Vis97] based on [Tom87].

Parsing Expression Grammars

A completely different approach to tackle the problem of ambiguities in context-free language constructs was more recently proposed by Ford [For04]. The PEGformalism is similar to the Extended Backus-Naur Form, but it adds prioritizedchoice to avoid ambiguities. Parsing expressions and PEGs are defined as follows(from [For04]):

Definition 2.1 (Parsing Grammar). Let G be a parsing expression grammar G =(VN , VT , P, S) with nonterminals VN , terminals VT , VT ∩ VN = ∅ productions P ofthe form A→ e with nonterminal A ∈ VN and a parsing expression e and designatedstart symbol S ∈ VN .

Page 24: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

10 2. Analysis

Definition 2.2 (Parsing Expression). The empty string ε, every terminal a ∈ VTand every nonterminal A ∈ VN is a parsing expression.Let e, e1, e2 be parsing expressions. Then a sequence e1e2, a prioritized choice e1/e2,the Kleene closure e∗ and the not-predicate !e are parsing expressions, too.

There is a strong connection between PEGs and backtracking recursive-descentparsers. It is very straightforward to write such parsers for PEGs. With priori-tized choices there is no need to construct parse forests because the alternatives canbe tried until a matching alternative is found. As pointed out before, the backtrack-ing nature can exhibit exponential time complexity. This is covered by memoizationto guarantee linear parsing time.

Packrat Parsing

The parsing technique described in [For02] avoids the situation of exponential parsingtime due to backtracking by saving the intermediate parsing results as they arecomputed in order to avoid that parts of the input are parsed more than once,trading memory consumption for performance. Additionally, the packrat parsingtechnique exhibits some interesting properties that are useful for the discussed areaof application:

• Unlimited LookaheadPackrat parsers have no lookahead restriction. Unlike LL(k) or LR(k) parsers,which take into account the following k tokens for transitions or reductions,packrat parsers have no fixed lookahead. The backtracking nature (in contrastto prediction) allows to recognize a broader class of languages.

• ScannerlessnessIn predicting parsers, tokens are needed in order to predict the next action.Looking at the next few characters only is generally not sufficient to decideon what to do. That is why predicting parsers always rely on a separatelexical analysis phase for token creation. Packrat parsers can use the unlim-ited lookahead to scan tokens of arbitrary length enabling to integrate lexicaland syntactic analysis. Implementations of packrat parsers are thus usuallyscannerless by design.

• ComposabilityPredicting parsers are not suited for composition. Consider an evolving gram-mar to which an alternative is added containing a new arbitrary nonterminalin the middle. Whenever there is another alternative with the same prefix, apredicting parser with limited lookahead will fail to predict the correct alter-native because the nonterminal might be nested and thus of arbitrary length.In contrast, packrat parsers are able to look beyond the nonterminal and even-tually decide on whether the alternative fits. Hence, composition of languageconstructs can be facilitated with a packrat parser.

The current implementation of the framework uses ANTLR to produce domain-parsers. ANTLR is a recursive-descent, predicated LL(*) parser generator. This

Page 25: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

2.2. TCS Mapping Language 11

# Syntax1 Type Nonterminal += <Name1> e / <Name2> ... ;2 Type Nonterminal += <Name1> ... / <Name2> e ;3 Type Nonterminal -= <Name> ;4 Type Nonterminal := ... / <Name> e ;5 Type Nonterminal := e ;6 Attributes Type Nonterminal := ... ;

Figure 2.2: Overview of Rats!’ module modification syntax. The different modifi-cations (1) add a new alternative before an existing one, (2) add a new alternativeafter an existing one, (3) remove an alternative, (4) override an alternative with anew expression, (5) override a production with a new expression, and (6) override aproduction’s attributes, respectively. (from [Gri06])

implies that it works top-down and accepts a proper subset of the class of context-free languages. Although it features backtracking and memoization, too, the deter-ministic finite automaton (DFA) employed for arbitrary lookahead is not as powerfulas a (deterministic) pushdown automaton. Furthermore it relies on tokens as lexicalunits which prohibits composability.

Rats! is considered the most appropriate parser generator [Gri06]. It producespackrat parsers, is freely available with sources and written in Java. Apart from that,its most prominent feature is the support for modular syntax definitions. Possiblerule modifications are listed in Fig. 2.2.

In addition, modules can be parameterized. This is particularly helpful when im-ported syntax elements change. Then the host language automatically gets thechanges from the module passed as argument.

2.2 TCS Mapping Language

A language for specifying mappings between abstract and concrete textual syntaxwas proposed by Jouault et al. [JBK06]. The main idea was to provide a simple yetconcise template-based language that provides bidirectional specifications betweenabstract and concrete syntax. Bidirectional here means that a TCS artifact can beread in two directions:

• From model to textGiven a model instance conforming to an abstract syntax model and the TCSmapping determine a textual representation capturing all model elements, at-tributes and associations as defined by the mapping. This direction is some-times referred to as prettyprinting the model.

• From text to modelIn order to edit text from which a model can be created unambiguously themapping must specify the other direction, too. Parsing in combination withmodel injection is the process transforming text to a model. Therefore, TCSneeds constructs that guide a parser when recognizing textual constructs rep-resenting model elements.

Page 26: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

12 2. Analysis

For illustration, consider an arithmetic expression like 3 + 4 ∗ 9. An abstractsyntax tree for it will have an addition element at the root node with an integerliteral (3) on the left and a multiplicative element on the right containing twoleaves (4 and 9). Emitting text for this abstract representation is straightfor-ward. Without knowing arithmetic rules, an emitter may roll out the abstractsyntax and emit text for each element, from left to right. One might evenadd parentheses surrounding the multiplicative expression to emphasize theprecedence (known from the tree structure of the AST).

Now coming from the textual concrete syntax the TCS mapping needs todeclare, first, the precedence of operators and, second, how compound expres-sions are represented as models. This is why the mapping contains a list ofoperators and templates for theses operatored constructs. With this infor-mation it must be possible to derive a parser for the domain of arithmeticexpression with the abstract and concrete syntax definition.

The entire section deals with the different TCS constructs (always depicted as MOFdiagrams of their meta-elements). The focus is on language constructs that aremost relevant for the creation of parsers automatically generated from a TCS syntaxspecification and the respective abstract language syntax.

2.2.1 An Introductory Example

To illustrate the fundamental idea of concrete-to-abstract syntax mappings the fol-lowing excerpt from TCS.tcs is given:

1 syntax TCS {23 primitiveTemplate stringSymbol for PrimitiveTypes::String using

STRING:4 value = ”unescapeString(%token%)”;56 template TCS::ConcreteSyntax main context7 : ”syntax” name (isDefined(k) ? ”(” ”k” ”=” k ”)”) ”{” [8 templates9 (isDefined(keywords) ? ”keywords” ”{” [ keywords ] ”}”)

10 (isDefined(symbols) ? ”symbols” ”{” [ symbols ] ”}”)11 operatorLists12 tokens13 (isDefined(lexer) ? ”lexer” ”=” lexer{as = stringSymbol} ”;”)14 ] {nbNL = 2} ”}”15 ;1617 ...1819 }

Listing 2.2: TCS introductory example (from TCS.tcs)

Without going into detail, a few things should be noticed about listing 2.2:

Page 27: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

2.2. TCS Mapping Language 13

• Mappings define concrete syntax per meta class (two in the example: primi-tivetypes.String and tcs.ConcreteSyntax ) in a template manner.

• Syntax provided for a meta class can depend on its attributes and referencesor not (class template vs. primitive template)

• Within a class template a meta class’s attributes can be referenced regardlessof their multiplicity. Following the template style, concrete textual syntaxfor the respective elements is expected (or emitted) where the attribute isreferenced (name, k, templates, keywords etc.).

• Additional formatting information may be given (square brackets and optionnbNL=2 ) to specify the exact output.

2.2.2 tcs.ConcreteSyntax

Symbol

Tokenpattern : OrPatternisOmitted : Boolean

Keyword

OperatorListname : String

Templatedisambiguate : StringdisambiguateV3 : String

ConcreteSyntaxlexer : Stringk : Integer

0..n1

+symbol

0..n

+concretesyntax1

0..n

1

+token0..n

+concretesyntax 10..n

1+keyword

0..n

+concretesyntax1 0..n

1

+operatorlist0..n

+concretesyntax1

0..n

1

+templates0..n

+concreteSyntax1

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 21:02:00 Dienstag, 13. Oktober 2009 Class Diagram: tcs / ConcreteSyntax Page 1

Figure 2.3: MOF diagram of TCS element ConcreteSyntax

The main meta element of a TCS mapping is tcs.ConcreteSyntax holding, as acomposition, templates, keyword and symbol definitions, token specifications andoperator lists. The attributes lexer and k specifically for use with the ANTLR parsergenerator. They can hold the maximum lookahead (k) and a string representinglexer code that can override default lexer provided by ANTLR . Hence, they can beignored in the context of the work at hand.

2.2.3 tcs.Template

The abstract type template is the backbone of a syntax definition. For meta classesfrom the abstract syntax (metamodel), concrete textual syntax can be specified, usu-ally one template per meta element.2 As a QualifiedNamedElement every templatereferences a Classifier from the M3 metamodel, i.e. an element of the abstract syn-tax. This is the element whose concrete textual syntax is specified by the respectivetemplate.

The various subtypes of tcs.Template each serve a special purpose:

2template modes are intentionally omitted for simplicity

Page 28: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

14 2. Analysis

ClassTemplateisAbstract : BooleanisDeep : BooleanisOperatored : BooleanisMain : BooleanisMulti : BooleanisContext : BooleanisAddToContext : BooleanisNonPrimary : BooleanisReferenceOnly : Booleanmode : String

PrimitiveTemplatetemplateName : StringtokenName : Stringvalue : Stringserializer : StringorKeyword : BooleanisDefault : Boolean

EnumerationTemplateautomatic : Boolean

FunctionTemplatefunctionName : String

OperatorTemplateisContext : BooleanisReferenceOnly : Boolean

Templatedisambiguate : StringdisambiguateV3 : String

ConcreteSyntaxlexer : Stringk : Integer 0..n1

+templates

0..n

+concreteSyntax

1

0..n

1

+imports0..n

+importedBy1

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 19:33:58 Montag, 5. Oktober 2009 Class Diagram: tcs / Templates Page 1

Figure 2.4: MOF diagram of TCS element Templates

• Primitive templates define syntax for simple, lexical constructs.

• Class template is the most important type. It is used to specify concrete syntaxfor complex classifiers.

• Operator templates are defined for model elements representing compoundexpression connected by one or more designated operators. Typically they areused with operatored expressions.

• Enumeration templates refer to classifiers of an enumeration type and mayspecify the appearance of its literals.

• Function template serves to factorize syntax that appears more than once.

2.2.4 tcs.ClassTemplate

The central element to specify concrete syntax for models is ClassTemplate . It offersa vast amount of features most of the changes in the SAP-specific version target theclass template.

Class templates define the textual representation of a classifier. The right-hand sidesequence of a class template specifies what is used to produce text from a giveninstance of the meta class or how the parser should interpret text to create such aninstance. This sequence consists of textual elements independent from the model(string literals for example) and references to model properties (for details see 2.2.6).

Page 29: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

2.2. TCS Mapping Language 15

OperatorListname : String

SequenceisReferenceOnly : Booleanmode : String

ClassTemplateisAbstract : BooleanisDeep : BooleanisOperatored : BooleanisMain : BooleanisMulti : BooleanisContext : BooleanisAddToContext : BooleanisNonPrimary : BooleanisReferenceOnly : Booleanmode : String

0..1 0..1+classtemplate 0..1 +operatorlist0..1

0..10..1 +templateContainer0..1+templatesequence0..1

0..1

0..1

+prefixContainer

+prefixsequence

0..1

0..1

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 19:29:23 Montag, 5. Oktober 2009 Class Diagram: tcs / ClassTemplate Page 1

Figure 2.5: MOF diagram of TCS element ClassTemplate

For generation of a parser from TCS definitions three attributes are specificallyinteresting:3

• main: When a template is defined as main the parser will start parsing agiven syntax with this rule. In the introductory example (2.2) the class tem-plate for tcs.ConcreteSyntax is tagged as main implying that textual artifactsconforming to the specified language start with syntax and end with a closingcurly brace.

• abstract: The prevalent use cases for abstract templates is, first, with modelelements inheriting from others and, second, with operatored templates. It ispossible to state for an abstract model element that its syntax is specified inthrough the subtypes. With the operatored option the user can specify thatan abstract model element consists of various subtypes connected by operators(of different priorities). In the context of parser generation for domain-specificlanguages this is the toughest case and needs special consideration.

• referenceOnly: When a complex model element is referenced but shouldnever be created a referenceOnly template can be provided for it.

The right-hand side of a class template is a Sequence consisting of 0..* SequenceEle-ments which is the abstract super type of everything representing any contributionto the syntax or specifying detailed semantics for the creation of models (fig. 2.6and 2.8).

References to a model’s attributes can be established simply by stating the ele-ment’s name within a sequence. Theses properties can be complemented by variousoptions (subtypes of PropertyArg in 2.9). Following the template paradigm the righttemplate for the referenced model element is looked up and the according syntaxspecified by the class template is inserted.

In the opposite direction (parsing) the attributes are set according to what is tex-tually present.

3template attributes multi, context, addToContext, deep not discussed here. Discussion of non-Primary postponed to sec. 2.2.7

Page 30: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

16 2. Analysis

2.2.5 tcs.Sequence and tcs.SequenceElement

SequenceInAlternativedisambiguate : String

AlternativeisMulti : Boolean

0..n1

+sequences

0..n

+alternative

1

SequenceElement

Block

SequenceisReferenceOnly : Booleanmode : String0..n

0..1+sequenceelement

0..n

+sequence 0..1

1

0..1

+sequence1

+block0..1

ConditionalElement

0..1

0..1

+elseSequence0..1

+elseContainer0..1

1

0..1

+thenSequence 1

+thenContainer0..1

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 19:33:23 Montag, 5. Oktober 2009 Class Diagram: tcs / Sequence Page 1

Figure 2.6: MOF diagram of TCS element Sequence

As right-hand sides of a class template, Sequences represent the mapping from onemeta class to text and vice versa. Its elements (of abstract type SequenceElement )can be of structuring kind such as Block , Function or CustomSeparator or model-related type such as Property and InjectorActionsBlock or related to choice such asAlternative or ConditionalElement or purely syntactical such as LiteralRef . See fig.2.8 for a MOF diagram of all elements inheriting from SequenceElement .

ExpressionConditionalElement0..11 +expression0..1+conditionalelement1

AndExp

PropertyReferencename : String

AtomExp0..n

1

+atomexp0..n+andExp

1

1

0..1

+propertyReference1

+atomexp0..1

BooleanPropertyExp IsDefinedExp EqualsExp OneExp InstanceOfExp

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 19:32:06 Montag, 5. Oktober 2009 Class Diagram: tcs / Expression Page 1

Figure 2.7: MOF diagram of TCS element Expression

ConditionElement

Depending on a condition of type Expression (depicted in fig. 2.7) different sequences(then-sequence and optional else-sequence) can be stated in a ConditionalElement ofthe following form:

Page 31: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

2.2. TCS Mapping Language 17

condition ? thenSequence : elseSequence

This is especially useful for sequences that are displayed depending on whether areference is set with isDefined.

SequenceElementAlternativeisMulti : Boolean

CustomSeparatorname : String

FunctionCall

ConditionalElement

Block

InjectorAction

InjectorActionsBlock

0..n

1

+injectorActions0..n

+injectorActionssblock1

Property

PropertyReferencename : String

0..1

1

+property0..1

+propertyReference1

PropertyInitvalue : String

1

0..1

+propertyReference1

+propertyinit0..1

PrimitivePropertyInit LookupPropertyInit

Literalvalue : String

LiteralRef

1

0..n

+referredLiteral 1

+literalref

0..n

Keyword Symbol

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 19:33:45 Montag, 5. Oktober 2009 Class Diagram: tcs / SequenceElement Page 1

Figure 2.8: MOF diagram of TCS element SequenceElement

Alternative

Although usually incorporated in the abstract syntax via inheritance TCS allowsalternative textual syntax within a sequence for multivalued references. For thispurpose, alternatives reference 0..* sequences of the specialized type SequenceInAl-ternative allowing nested sequences.

2.2.6 tcs.Property

Model attributes of primitive type such as string or integer do not require specialconsideration when printing or parsing. The more critical structural features arereferences to other model elements. The following questions arise when referencingmodel elements in textual syntax:

• How can the referenced model element be identified?

• How (and where) should model elements be created if the reference cannot beresolved?

Page 32: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

18 2. Analysis

PropertyArg

AutoCreateKindalways : AutoCreateKindifmissing : AutoCreateKindnever : AutoCreateKind

<<enumeration>>

ForcedUpperPArgvalue : Integer

ForcedLowerPArgvalue : Integer

AsPArgvalue : String

LookInPArgpropertyName : String

ModePArgmode : String

RefersToPArgpropertyName : String

QueryPArgquery : String

CreateInPArgpropertyName : String

FilterPArgfilter : Stringinvert : String

ImportContextPArg

CreateAsPArgname : String

AutoCreatePArgvalue : AutoCreateKind

DisambiguatePArgdisambiguation : String

PartialPArg

SequenceElement

SeparatorPArg

Property

0..n

1

+propertyarg0..n

+property1

PropertyReferencename : String

1 0..1

+propertyReference

1

+property

0..1

TypedElement(from model)

0..n

0..1

+property0..n

+strucfeature0..1

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 19:33:10 Montag, 5. Oktober 2009 Class Diagram: tcs / Property Page 1

Figure 2.9: MOF diagram of TCS element Property

Identification of referenced elements

The property options refersTo , lookIn and query serve to identify the modelelement that is being referenced through the syntactic construct. In the easiestcase, this can be a uniquely identifying attribute (name for some examples). Tospecify the scope of the lookup more explicitly (expanding or restricting it) propertyarguments with the lookin clause can be used. Common usage is to specify #all

to leave the current context or a path expression starting from the current metaelement.

The QueryPArg is an extension to the original TCS . It allows to use OCL or MQL4

statements identifying model elements that could not be referenced otherwise (by apath expression or by attribute lookup).

Creation of referenced elements

Usually referenced elements should be created by the model injector attached tothe parser. After all, this is the central idea of textual modeling. In some specificcases this behavior needs to be overridden however. By using the property argumentautoCreate the mapping designer can specify whether the referenced model elementshould be created never, always or ifMissing (default).

4MQL: MOIN Query Language, a query language for models syntactically similar to SQL

Page 33: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

2.2. TCS Mapping Language 19

2.2.7 tcs.Operators and OperatorTemplate

A central characteristic of abstract language specifications is the absence of detaileddescription how the language constructs are represented. TCS provides a valuablefeature to close the gap between abstract language specification (metamodel) andtextual concrete syntax by means of Operators and OperatorTemplates . The basiccomplication arises from the fact that the mapping is not supposed to specify explicitconstructs akin to grammar rules. The textual concrete syntax should instead bedefined on a higher level by stating operators and priorities and the respective class-and operator templates. Thus, the framework generating a parser and model creatorfrom the metamodel and the language mapping must close this gap. Details on thatare discussed in ch. 3 and 4.

OperatorList

The different priority levels associated with an expression are specified in an operatorlist. In addition, for each operator arity and associativity can be declared.

Associativityright : Associativityleft : Associativity

<<enumeration>>OperatorListname : String

Priorityvalue : Integerassociativity : Associativity

0..n

1

+priority0..n

+list1

Literalvalue : String

OperatorTemplateisContext : BooleanisReferenceOnly : Boolean

OperatorisPostfix : Booleanarity : Integer

0..n1

+operator

0..n +priority11 0..n

+literal

1

+operator

0..n

0..n

0..n

+templates0..n

+operators0..n

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 19:32:42 Montag, 5. Oktober 2009 Class Diagram: tcs / OperatorList Page 1

Figure 2.10: MOF diagram of TCS element OperatorList

These operators can be used in combination with abstract operatored class tem-plates. When an operator list is stated in the header of an abstract operatoredclass template this means whenever the abstract type is referenced the concrete sub-classes must be processed respecting the given priorities of the operator list. In anarithmetic expressions example, the abstract classifier Expression might be givenan abstract operatored template with an operator list consisting of two priorities:the multiplicative ones on level 0 (highest) and the additive ones on level 1. Nowwhenever an arithmetic expression is expected two things are actually processed:

• the most elementary constituents of arithmetic expressions (number literalsmost likely) also called primaries

• the various combinations of primaries by means of operators. Typically binaryexpressions of multiplicative or additive type. The detailed definition of suchcompounds is only possible with OperatorTemplates (see below)

Page 34: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

20 2. Analysis

OperatorTemplate

Adding an operator list to an abstract template merely specifies in which orderthe sub-expressions must be parsed. However for a mapping of these constructsto the abstract syntax OperatorTemplates are needed. They define how elementsfrom the abstract syntax are put together. Consider the following operator templatespecification:

operatorTemplate PlusExp(operators = opPlus, source = leftside,

storeRightTo = rightside);

This implies that in abstract syntax all occurrences of the operator opPlus within anoperatored expression are represented as a model element of type PlusExp with ref-erence leftside set to what is parsed left of the operator opPlus (and rightside

respectively).

2.2.8 tcs.FunctionTemplate

Similar to programming languages, in TCS constructs common to more than onemodel element can be factored out using functions (expressed by function templates).FunctionTemplates are parameterized with a model element type. Its sequence(right-hand side) can access all properties defined in that element. Thus, functionsdefined that way can be used (”called”) within the sequence of a template specifyingthe concrete syntax for any of the subclasses of the parameter type.

Templatedisambiguate : StringdisambiguateV3 : String

SequenceElement

SequenceisReferenceOnly : Booleanmode : String

FunctionTemplatefunctionName : String

10..1 +sequence

1

+functionContainer

0..1

FunctionCall

1

0..n +calledFunction

1+functioncall

0..n

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 19:32:26 Montag, 5. Oktober 2009 Class Diagram: tcs / FunctionTemplate Page 1

Figure 2.11: MOF diagram of TCS element FunctionTemplate

2.2.9 tcs.EnumerationTemplate

For enumeration types, TCS provides a means to specify concrete syntax both easilyand flexibly. In the basic case, EnumerationTemplates rely on the enumeration’sliterals and create syntax automatically.

In a more complex case, 0..* mappings for (some or all) literals of an enumeration canbe specified where the right-hand side is an arbitrary sequence element as introducedin sec. 2.2.5.

Page 35: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

2.3. Terminology 21

LocatedElementlocation : StringcommentsBefore : StringcommentsAfter : String

Value

EnumLiteralValname : String

EnumerationTemplateautomatic : Boolean

SequenceElement

EnumLiteralMapping 11

+enumliteralval

1

+enumliteralmapping1

0..n

1

+enumliteralmapping0..n

+enumerationtemplate 1

1

1

+element1

+enumliteralmapping

1

File: C:\Users\c5126086\Martin_DA\TCSTEMP\tcs.mdl 15:26:43 Dienstag, 6. Oktober 2009 Class Diagram: tcs / EnumerationTemplate Page 1

Figure 2.12: MOF diagram of TCS element EnumerationTemplate

2.3 Terminology

For sake of clarity some of the most important terms are listed below with theirsynonyms. Although every care was taken to use the terms consistently this sectionseeks to prevent from misunderstandings.

Syntax Definition

Syntax definition or syntax specification refers to the definition of the concrete tex-tual syntax of a DSL. Although a DSLs consists of abstract and concrete syntaxes,the term syntax is usually used for the concrete syntax while metamodel refers tothe abstract syntax.

Mapping

Throughout the work at hand, the term mapping refers to concrete-to-abstract syn-tax mappings specified in TCS as defined in [JBK06].

Domain Parser

In the context of textual concrete syntax frameworks the term domain parser refersto the parser generated from a mapping and a metamodel in order to parse codewritten in the domain-specific language.

Injector

Injector is used for a framework component manipulating models. This componentis driven by the editor and domain parser.

Page 36: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

22 2. Analysis

2.4 Related Work

A number of textual concrete syntax frameworks have been proposed recently. Acomprehensive study analyzing the various frameworks can be found in [GBU08].This publication investigates and classifies the CTS frameworks based on an exhaus-tive set of framework features.

For the discussion of related works, especially modularity concepts or support forlanguage composition is most interesting.

• xText [BCE+] provides grammar mixins allowing for import of rules fromanother grammar. To our knowledge the parser generator is scannerless,too5. However, details considering lexical conflicts are not discussed in de-tail. Keyword-keyword conflicts are avoided by favoring new keywords overimported ones.

• The Stratego/XT toolkit [BV04] uses a scannerless generalized LR-parsingtechnique and comprehensive disambiguation features developed within theASF+SDF Meta Environment [vdBvDH+01]. The general difference is that itfocuses on the concrete syntax while for the work at hand a central premise isthat languages are modeled with an abstract syntax defined by a metamodel.

5see a blog entry of the project leader of the Xtext framework http://blog.efftinge.de/2009/01/xtext-new-parser-backend.html

Page 37: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

3. Design

As outlined in Ch. 2 a full-fledged textual editing framework (or concrete textualsyntax framework) produces a parser that accepts artifacts written in a domain-specific language and the textual editor that interacts with this parser. Usually,there will be a lexer, too, for lexical analysis beforehand. The central idea to avoidexplicit creation of tokens and associate token types, such as keyword or identifier,with a given token entails the need for complete rework of the mechanisms thatproduce the parser on the one hand and for integration with the existing techniquesthat update or manipulate the model.

Conceptually, there is a significant amount of parallels between the existing ap-proach using the traditional two-stage parser generator ANTLR and the one usingRats! proposed here. These parallels can be exploited reusing code from the trans-formation of a TCS instance into a valid grammar file. However, two fundamentaldifferences impose careful reconsideration of some central parts of the transformationcode. First the way Rats! -generated parsers work, which is backtracking, recursive-descent with an ordered set of alternatives, affects grammar code that deals withmodel injection (updating or creating model instances according to textual input)since no expansion of a nonterminal rule is guaranteed to succeed. Thus actions thatcreate model elements (or proxies thereof) must be potentially undone when a rulefails.

Incrementality of the resulting parsers do not allow to apply a multi-pass strategy.

Second the scannerless parsing technique affects all code relying on tokens as themost basic syntactic elements. Error logging and reporting makes extensive use oftokens and is not likely to be easy-to-write in a scannerless environment. To ourknowledge, this is a known and unsolved problem of scannerless parsing in general.

This chapter outlines the design for the two major parts of the work, which aretransforming a TCS -instance into a Rats! grammar and integrating the generatedparser into the existing textual editor framework.

Discussion of design decisions resulting from more technical barriers, such as Rats’incapacity to accept parameterized rules or its performance optimizations, are leftto the implementation Ch. 4.

Page 38: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

24 3. Design

3.1 Overview

Before stating the detailed design decisions that were made for a migration to ascannerless parser-generator the following section gives a brief overview of bothtechnological and conceptual issues that set limits to the solution space.

3.1.1 Technological Overview

Figure 3.1 gives a compact overview of the components of the textual editing frame-work. Central to the process of developing a new domain-specific language are twodocuments that are shaded in dark gray:

• The mapping definition specified in TCS is needed in order to define thetextual representation of a given model and vice versa. Usually, it is editedwith the same framework that is employed to edit language instances (”code”).

• A language instance, i.e. an instance of the language specified by the ab-stract syntax given as syntax model and the concrete textual representationto be derived from the mapping definition.

Figure 3.1: Architecture Overview. For a legend of the FMC diagrams see Fig. 1.1

Given a TCS mapping, the framework needs to be capable of parsing the definition(TCS parser) and of producing all components that are necessary for the analysisof code written in that designated language (DSL parser and -editor shaded in light

Page 39: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

3.1. Overview 25

gray). While the TCS parser must be shipped with the editing framework1 language-specific parts are subject to change and dynamically updated (upon a save actionon the mapping for instance).

All parsers included in the editing framework communicate with an observer respon-sible for updates and/or creation of model elements via requests stated in an efficientmodel query language. Changes in a textual editor opened for a language instanceare reflected in the underlying abstract code model when the code is re-parsed in-crementally. Although editing of code written in a DSL works just as editing codein Java, for example, the textual representation is never stored as such. Instead, adecorated domain model (instance of the abstract syntax model) is stored. Detailsadvocating this approach can be found in [GBU09].

3.1.2 Higher-Order Transformation Approach

Migrating from one parser-generating technology (ANTLR ) to another (Rats! )first and foremost concerns the way parsers for DSLs are constructed. Irrespectiveof the fact that the generated parsers must ”fit” into the FURCAS framework, whichseverely relies on tokens as lexical constructs and Java methods as units of syntacticconstructs, a valid parser must be produced from a TCS instance and the respectiveabstract syntax modeling the language. As depicted in Fig. 3.2, the parser generatorprocesses a grammar file which is produced by the TCS-to-grammar transformation.The grammar can be regarded as a condensed description of what is the concrete and

Figure 3.2: TCS to grammar transformation

abstract syntax of a DSL. From a language-theory viewpoint, for a domain-specificlanguage LD and its grammar G the following must hold: L(G) ⊇ LD, i.e. everysentence w can be generated by a series of steps from the start symbol S followingthe grammar: ∀w ∈ LD : w ∈ {w′ ∈ Σ∗ : S⇒∗G w

′}.2

1Note that TCS itself can be treated as a domain-specific language (with abstract and concretesyntax specified just the same way. Using the editing framework with this language specifica-tion results in a bootstrapped version of the TCS parser that is part of the framework. Thisparser/editor-pair can be used as a proof of concept, but is discussed later.

2The exact specification of the vocabulary Σ is omitted here, since this would usually be theset of tokens. Obviously, the declaration of tokens is not implicit in a scannerless environment. As

Page 40: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

26 3. Design

The transformation described in the following is of higher order: first, a grammarcan be viewed as a way to transform textual input into an abstract syntax. Second,it is the output of a transformation itself. This suggests the classification as ahigher-order transformation.

3.1.3 Bootstrapping the TCS

The aforementioned transformation needs a TCS instance (an instance of the TCSabstract syntax represented by its metamodel) as input, which is generally notpresent. Arguing that the TCS model can be generated using the concrete andabstract language specification leads to a vicious circle: it assumes the existence ofa fully-functional CTS framework with which it is generated since the TCS parseris part of the framework (see 3.1)3

The good news is that there is a solution that does not beg the question:

• The TCS model could be created using a model editor from outside the CTSframework. Graphical tree-based editors with appropriate property sheets canserve the purpose. Especially references between model elements can be quitehard to set and need special attention to be correct.

• One could question the entire higher-order transformation approach and startby writing a TCS grammar by hand. Although this solution seems temptingthe problems arising will be more subtle can only be understood knowingdetails of the targeted environment.

Parts of the framework that are to be reused include the parsing observer thatobserves the process of parsing any language artifact. It is responsible fordelegating actions for model creation or update. These actions need to be inthe right position in the parser code to work properly. Adding such actioncode to the hand-written grammar is not only tedious but also error-prone.

• The special task of the work at hand is to study how an existing CTS frame-work can be modified to allow language specifications to be composed. Thisoffers the chance to not only reuse significant parts of the framework but alsouse them in order to create the new one.

Creation of a TCS instance modeling the TCS language is such a point: feedingthe TCS syntax mapping and the metamodel into the environment leads tothe desired model. Leaving aside technical issues such as serializing the modelappropriately carries out the task in an elegant way. This solution is favoredand the most important components and artifacts of the bootstrap are depictedin Fig. 3.3 with the desired result shaded in light gray.

3.2 TCS Modifications for Composition

As part of the design towards language compositions, the TCS metamodel andconcrete syntax needs slight modifications. This highlights the fact that TCS is a

can be see later, explicit creation of tokens can be emulated, however, by what the requirementmakes sense.

3A classic Petitio Principii in which the proposition is assumed to be true as part of the premise.

Page 41: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

3.2. TCS Modifications for Composition 27

Figure 3.3: Using the legacy TCS parser to create a TCS instance

domain-specific language, too. Its domain is the specification of concrete syntaxesfor models. Similar to all DSLs changes to the language must apply to the abstractsyntax and all of its concrete syntaxes.

Modification of Abstract Syntax

On the abstract syntax level composition can be supported by adding a loop ”im-ports” to ConcreteSyntax allowing to reference (import) concrete textual syntaxesspecified in different TCS specifications.

The modified TCS excerpt is depicted in Fig. 3.4. The imports form a tree with adesignated root representing the concrete syntax of the main model element. Withinthis document, constructs can appear representing imported syntax.

Symbol

Tokenpattern : OrPatternisOmitted : Boolean

Keyword

OperatorListname : String

Templatedisambiguate : StringdisambiguateV3 : String

ConcreteSyntaxlexer : Stringk : Integer

0..n1

+symbol

0..n

+concretesyntax1

0..n

1

+token0..n

+concretesyntax 10..n

1+keyword

0..n

+concretesyntax1 0..n

1

+operatorlist0..n

+concretesyntax1

0..n

1

+templates0..n

+concreteSyntax1

ImportDeclaration

0..n

0..n

+imports

+concreteSyntax0..n

0..n

File: C:\Users\c5126086\Martin_DA\thesis\figures\src\tcs.mdl 13:41:17 Montag, 2. November 2009 Class Diagram: tcs / ConcreteSyntax Page 1

Figure 3.4: MOF diagram of tcs.ConcreteSyntax after adding import

There are two possible designs for compositional syntax mappings:

• Import of a whole abstract-to-concrete syntax mapping including all syntaxdefinitions defined in the mapping. This resembles the import declarationimport mypackage.* in Java that imports all classes from a package

Page 42: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

28 3. Design

• Import of a subset of constructs defined in a mapping: this requires checksthat this subset is closed as defined in 3.3 (well-formedness).

The first choice is preferred since a goal of composing languages with the describedframework is to reuse syntax specifications already written. This should not requirewhite box knowledge of its internal constructs (templates etc.). From a languagedesigner viewpoint, it should be possible to simply use an existing language con-struct. Since the relationship between the different templates of a syntax definitionis implicitly defined by the referenced metamodel it is essential to provide a white-box import mechanism. As opposed to importing Java classes, the interface of theimported definitions is not always known in advance.

The parsers resulting from each imported mapping are merely supposed to beplugged in. This refers to composition on the tool level. The work at hand focuses onthe combination of lexical and syntactic analyzers and considers tool compositionas a succeeding task. We therefore investigate whether such functionality can beconsolidated with a scannerless parsing technique of the underlying domain parsers.

Modification of Concrete Syntax

The common syntax for TCS mappings must provide a way to specify import dec-larations textually. A simple construct referencing imports before the actual syntaxdefinition can be added as shown in Listing 3.1. Note that the additional tem-plate for ConcreteSyntax is needed in order to specify an import just by stating its(qualified) name.

12 template TCS::ConcreteSyntax main context3 : imports4 ”syntax” name (isDefined(k) ? ”(” ”k” ”=” k ”)”) ”{” [5 templates6 (isDefined(keywords) ? ”keywords” ”{” [ keywords ] ”}”)7 (isDefined(symbols) ? ”symbols” ”{” [ symbols ] ”}”)8 operatorLists9 tokens

10 (isDefined(lexer) ? ”lexer” ”=” lexer{as = stringSymbol} ”;”)11 ] {nbNL = 2} ”}”12 ;1314 template TCS::ImportDeclaration15 : ”import” concreteSyntax {refersTo = name}16 ;

Listing 3.1: Extract of the modified concrete syntax specification for TCS

The combination of modified abstract syntax (TCS metamodel) and concrete syntax(TCS mapping for TCS ) can be processed by the bootstrapped TCS parser asoutlined in the overview 3.1.2. The result will be a TCS parser recognizing syntaxthat contains compositional elements (import statements).

However, the crucial part of composing languages is creating parsers for the com-posed languages. The following section discusses the transformation from a TCSinstance to a grammar which is the basis for compositional parsers.

Page 43: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

3.3. TCS-to-Grammar Transformation 29

3.3 TCS-to-Grammar Transformation

For correctness of the transformation of a TCS instance into a parser, a formal spec-ification of both the source and target languages involved is of great use, irrespectiveof its actual implementation as a model-to-model transformation (operational or re-lational) or written in a general-purpose programming language. In the following,particular elements of the source metamodel (TCS ) are opposed to their trans-formed result (grammar element). For better readability the textual representationof grammar elements is chosen.

The transformation task can be stated formally by the following: Let T be a TCSinstance conforming to the metamodel MT CS described in Sec. 2.2 specifying theconcrete textual syntax of a DSL L with abstract syntax given by the metamodelMDSL. Let G be the output grammar. Then G is expected to meet the followingconditions:

• Well-formednessG must be a valid Rats! grammar. The code generating facilities of Rats!must accept the grammar if the mapping is a closed specification of a concrete-to-abstract syntax mapping. Closed here means that templates are specifiedfor all elements of MDSL that are (directly or indirectly) referenced frommain model element. This is a weaker assumption than restricting oneself tocomplete syntax mappings. Complete would mean that for all model elementsconcrete syntaxes need to be specified. This relaxation is especially usefulfor large metamodels where only parts are to be edited by means of textualsyntax.

Generally, a separate step testing the validity of the input model is desirable.Validation on T should thus be performed before start of the transformationto guarantee proper output.

• Syntactic correctness of action codeIf G contains action code, which is true for all non-trivial cases of T , this codemust fit syntactically into the generated parser code. Although this mightsound like a trivial requirement snippets of code containing variable declara-tions and compound statements are likely to produce syntactically incorrectcode. The correctness criterion must hold for all closed instances T .

• AdaptabilityWithout knowing details of MDSL an editor must be able to use the parsergenerated from G. Especially partial re-parsing of textual input needs to besupported by directly invoking parser methods. For this purpose, a nomencla-ture for metamodel elements is needed stating an injective map

nom :MDSL ×modes→ D

with D denoting the domain of legal names for Java identifiers and modesbeing the set of template modes used. With this name convention lookup andreflective invokation of methods can be performed.

A designated method name for parsing the main element of a syntax must beassociated with the rule originating from the TCS template with isMain setto true.

Page 44: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

30 3. Design

3.3.1 Concrete Syntax to Grammar

As explained in Ch. 2 a promise of the Rats! parser generator framework is its sup-port for modularization of syntax specifications via Modules that can be imported,extended and reused.

For composition of concrete-to-abstract syntax mappings this is considered a valu-able feature. Consequently, the transformation is designed to have one module perconcrete syntax definition. The whole tree of definitions importing each other iscaptured by a Rats! grammar.

Input 3.3.1.1 (ConcreteSyntax). C with syntax imports I, set of template declara-tions T , set of keywords K, set of symbols S, list of operators O.

For future ease of reference, elements of T can be partitioned according to theirtypes. With TC denoting ClassTemplates , TP denoting PrimitiveTemplates , TO de-noting OperatorTemplates , TE denoting EnumerationTemplates and TF denotingFunctionTemplates :

T =⋃

i∈{C,P,O,E,F}

Ti

Output 3.3.1.1 (Grammar). G with Modules for all syntax imports i ∈ I andModuleModifications if the imports are not disjoint with respect to their set of non-terminal. If no imports are present (leaves in the import tree) there is a one-to-onerelation between ConcreteSyntax and Module without modifications.

Those modules contain productions resulting from the transformation of templatesT (see 3.3.2), operators O (see 3.3.3), keywords K and symbols S (see 3.3.7).

Additionally, for model injection and observation of the parsing process the modules’options need to be set to stateful . The detailed semantics of this keyword is leftto Sec. 4.3.

3.3.2 Class Templates to Productions

A ClassTemplate states what is the expected textual representation of a modelelement. This may be a combination of lexical parts, such as references to symbolsand keywords, or parts that depend on the element’s structural features. Eachmetamodel element that is supposed to be edited with concrete textual syntax needsa corresponding specification by one (or more) class templates.

As can be seen in Fig. 2.5 many different cases transforming class templates needto be considered (according to the assignments of the various attributes). This partis central for the TCS to grammar transformation.

Special cases to be considered include:

• abstract or abstract operatored with or without syntactic contribution

Page 45: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

3.3. TCS-to-Grammar Transformation 31

• main template

• modes allowing different syntax definitions for one meta class

• referenceOnly templates

• context tags attached to templates

Common to all class templates is the fact that their sequences need to appear ona right-hand side of a production so that a context-free parser generated from thegrammar can expand the respective non-terminal to all syntax representing themodel instance.

Input 3.3.2.1 (ClassTemplate non-abstract). t ∈ TC with reference to Sequence s.Let t be non-abstract.

Output 3.3.2.1 (Production). p resulting with nonterminal name set to the valueof the nomenclature map nom applied to the meta type for which t was specifiedand t.mode, pre- and post-actions Apre and Apost. p’s declared return type is Objectto allow any model element to be created and p is a stateful production to allowmicro-transactions to observe the parsing process.

Visibility of p is public only if t.isMain equals true . Otherwise it is private .Thus, only the top-level language constructs can be parsed from outside.

The right-hand side of p will be the result of the transformation applied to all elementsof s’s SequenceElements .

Output 3.3.2.2 (Action). Apre creating a model element proxy with the referencedmeta type. This proxy will store all attribute values until it is passed to the modelquery engine in the post-action.

Apre must pass the context information, i.e. the boolean flag t.isAddToContext andits optional context tags to the proxy for resolution of references.

If t is a referenceOnly template, a dedicated reference proxy will be created withinApre that never leads to model creation.

Output 3.3.2.3 (Action). Apost setting the final return value. This will always bethe result of the delayed model creation or resolution (can only be completed afterparsing the entire template sequence.

For abstract class templates the transformation is slightly more complicated. Thisis due to the fact that in TCS it is possible to carry the notion of choice expressed

Page 46: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

32 3. Design

by the is-a relationship in the abstract syntax via inheritance over to the textualsyntax definition. Adding the keyword abstract to a class template is enough tospecify a template inherits the textual representation of a super type.

Furthermore, abstract class templates play a part in the context of operatored ex-pressions. Adding the operatored keyword and an operator list to an abstractclass template defines that the concrete textual syntax of an abstract model elementis the combination of its sub-elements using the operators with the specified prior-ities and associativities. For generation of a parser, however, the notion of choice(abstraction) and priorities (operators) needs to be encoded into the grammar viaalternatives.

The following two input/output relations are for abstract [operatored] class tem-plates:

Input 3.3.2.2 (ClassTemplate abstract). ta ∈ TC with reference to (possibly orlikely empty) Sequence sa. Let ta be abstract now.

Output 3.3.2.4 (Production). pa’s head is very similar to the one in 3.3.2.1. But,Apre and Apost are much simpler (see below) and the production need not be statefulbecause model creation is done in the actual concrete subtemplates.

If sa is the empty sequence, the right-hand side of pa is a set A of alternativesreferencing subtemplates (with s a L denoting s specifies syntax for L):

A ⊇ {t ∈ T : ∃M,M′ ∈MDSL : M

′extends M ∧ ta aM ∧ t aM ′}

Output 3.3.2.5 (Production abstract contents). If oa is not empty there is anadditional alternative in pa for the abstract contents. This is a separate productionof type non-abstract as detailed in 3.3.2.1.

Output 3.3.2.6 (Actions). Apre is empty for abstract templates. Apost is only re-sponsible for assigning the return value which is the result of one of the alternativesin the sequence.

A prominent feature of TCS is its uncomplicated specification of operatored expres-sion. This is done via abstract operatored class templates that have a correspondingabstract syntax element and a list of operators with associated priorities. To rec-ognize expressions of the operatored fashion, there is a need for a sequence of rulesparsing structures from different priority levels plus the rule responsible to parse theactual concrete syntax for subtypes of the abstract syntax element (called primaryrule in the following).

Page 47: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

3.3. TCS-to-Grammar Transformation 33

Consider arithmetic expressions consisting of positive integers connected by opera-tors + and * with multiplication having a higher priority than addition. The concretesyntax for such language fragment might be expressed by an abstract template forExpression pointing to the two-element priority list containing * on level 0 and +on level 1.4The transformation to a grammar will result in four productions:

• An entry rule for the abstract syntax element Expression following 3.3.2.2

• A primary rule for concrete subtypes parsing integer literals in the example

• A rule for additive structures (priority 1)

• A rule for multiplicative structures (priority 0)

These will be executed in the order entry → priority 1 → priority 0 → primary.

Structurally, the priority i rules are defined only by the operator list. The actualright-hand side, however, is determined by the existence of an OperatorTemplatefor the respective operator. Only this operator template specifies which abstractsyntax element corresponds to the combination of elements with this operator mustbe created or modified. In the above example, there might be a generic BinaryExpmodel element referencing a left- and right-hand side. Detailed discussion of thetransformation of operator templates is postponed to Sec. 3.3.3.

Input 3.3.2.3 (ClassTemplate abstract operatored). Let ta ∈ TC now be ab-stract and operatored with operator list oa containing m priority levels prio0 throughpriom−1.

Output 3.3.2.7 (Production abstract operatored). pa as entry rule: not stateful,only delegating to the priority m rule with lowest priority.

If ta has an additional nonempty sequence 3.3.2.5 applies.

Output 3.3.2.8 (Production priority k). For each level of priority k = 0...m − 1a production pprio k is produced parsing syntax corresponding to priority level priok.See 3.3.3 for details of the right-hand side of these productions.

Output 3.3.2.9 (Production primary). pprim having alternatives A referencing tem-plate rules for subtypes of Ma that are not marked with keyword nonprimary . So,similar to 3.3.2.4 A is:

A = {t ∈ TC ∪ TP ∪ TE : ∃Ma,M′ ∈MDSL :

M′

extends Ma ∧ ta aMa ∧ t aM ′ ∧ t ∈ TC ⇒ ¬t.nonPrimary}

4Priorities will be numbered from 0 (highest) ascending to m−1 (lowest). Note that this impliesthat the highest operator has the lowest index.

Page 48: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

34 3. Design

Output 3.3.2.10 (Actions). Pre- and post-actions of the entry rule and of theprimary rule are identical to the abstract case without operatored keyword: 3.3.2.6.

Actions for the priority k case are discussed in 3.3.3.

3.3.3 Operator Templates to Productions

While the structure of rules created from operatored templates can be inferred froman abstract operatored class template and the referenced list of operators, an Op-eratorTemplate is needed in order to specify how the constructs expressed withtheses operators are represented in abstract syntax. In the above arithmetic exam-ple (in 3.3.2) an abstract syntax element BinaryExp was suggested referencing two(abstract) expressions as left- and right-hand side.

This illustrates the complication that arises when translating the abstract syntax torules: while it is perfectly allowed to represent binary expressions of various types byone abstract model element (e.g. varying only in an operator attribute) the parserneeds a hierarchical structure to set the references correctly. For that, not onlyprecedence of operators (via priorities) but also arity and associativity affect therule creation:

Arity

Arity of an operator is its number of arguments. Usually only unary (n=1) andbinary (n=2) operators are present. Ternary (n=3) expressions or operations ofhigher arity generally need two different operators such as the ternary expression foran if-then-else statement in Java syntax:

T value; if (exprIF) value = expr1; else value = expr2;

written with the operators ? and : as a single statement

T value = exprIF ? expr1 : expr2;

Associativity

Associativity of binary operators states how parentheses can be rearranged, i.e.whether op(op(x, y), z) = op(x, op(y, z)). Non-associative operators must be brack-eted or specified as left- or right-associative

• Left-associativity: x op y op z := (x op y) op z ∀x, y, z

• Right-associativity: x op y op z := x op (y op z) ∀x, y, z

Subtraction is a well-known example of a left-associative operation, i.e. 8− 2− 2 =(8− 2)− 2 = 4. In contrast, exponentiation is right-associative: ab

c= a(b

c).

Consider the case of only one binary left-associative operator � connecting variablesdenoted by letters. The abstract operatored expression then calls the level 0 rulepriority 0 which is responsible for parsing all elements on that priority level.

An input such as a� b� c must be parsed by three rules

Page 49: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

3.3. TCS-to-Grammar Transformation 35

• rule priority 0 for expressions on level 0 (which can be the entire input inthe example)

• rule primary for the literals (not discussed in further detail)

• rule binary as operator rule for the connection of expressions with �.

Associativity comes into play when connecting the different rules. In both cases rulebinary is responsible for creating the model element that associates the left- andright element of the binary operation. In the left-associative case, this must resultin an element having a as left and b as right side and another element having thiscompound as left and c as right side. Consequently, rule binary must parse onlythe next construct on the same priority level and set it as the right side. This yieldsthe parse (a� b)� c.

In the right-associative case, rule binary gets a as the left side and its right sidecan be an arbitrary expression (another binary b � c in the example). This yieldsthe parse a� (b� c).

Prefix vs. Infix vs. Postfix

The usual notation for binary operations is infix, i.e. operand1 followed by operatorfollowed by operand1. In TCS , there is a way to specify an operator as postfixleading to notations of the form operator1 is followed by operator2 followed byoperator.

While infix is most common for binary operators unary operators are commonlydenoted both prefix and postfix. Consider the two variants of the increment operator++i and i++ in Java. Transforming a unary prefix operator and the accordingoperator template to a grammar leads to some interesting aspects. Usually the rulefor the binary operator template is called when the parser reaches the operator (afterparsing the first operand) and the left-hand side is passed to the operator templatewhich can create or update the according model element. Unary prefix operatorsmust be parsed, however, before the operand is parsed.

These observations can be stated more formally in the following relation. Note thatthe complication of transforming operator templates arises from the need to bringinformation from different parts of the syntax definition together in one rule.

Input 3.3.3.1 (OperatorTemplate). to that associates a classifier with a set of op-erators O from the universe of declared operators O, i.e. O ⊆ O.

Let to have a sequence so and point to structural features of metamodel type ML (asleft-hand side) and MR (optionally, as right-hand side) with ML,MR ∈MDSL.

Output 3.3.3.1 (Production priority k). pprio k on level k results from the operatorson priority level k in an operator list. Its right-hand side is a call to next-lowerpriority rule pprio (k−1) or the pprim if k = 0 and a repeated element with j alternativeswith j being the number of operators on priority level k.

Page 50: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

36 3. Design

The alternatives each consist of the operator literal followed by an action Apsh fol-lowed by a call to the rule representing an operator template to for it (optionallyfollowed by a call to a rule tass parsing the right-side of the operation).

Depending on the associativity of the operator tass is either pprio k (right-associative)or pprio k−1, pprim resp. (left-associative).

Output 3.3.3.2 (Action PSH, POP). Apsh and Apop push and pop model referencesonto / from a stack to be processed by the operator template rule.

Output 3.3.3.3 (Production). production for operator template to. Right-hand sideis the transformation of sequence so. Model creating pre- and post actions accordingto 3.3.2.2 and 3.3.2.3 setting the references to left operand (via Apop) and optionallyto right operand.

So the idea of the operator templates is to collect all information that is needed inorder to create the model element representing the expression. By passing referencesto model elements on the left side of an operator to the operator template rule (whichhas information where to store the left- and right side) the model element’s propertiesare collected piece-by-piece.

For left-recursive constructs, specifying operator templates is the only way to imple-ment them in a fashion that can be accepted by LL-type parser generators. This iswhy they are usually employed for more than only typical operatored expressions.This may give a hint to the importance of this part of the transformation.

3.3.4 FunctionTemplate to Production

Function templates provide concrete syntax for parts of model elements. The ideais to specify concrete syntax for features of a model element that are common tomany (or all) of its subtypes. By simply calling the function the function sequenceis executed as if it was pasted into the class template’s sequence.

For the transformation, this makes it easy to specify the desired input-output rela-tion.

Input 3.3.4.1 (FunctionTemplate). Let tf be the function template to transformwith sequence sf .

Output 3.3.4.1 (Production). The result is pf returning Object and name identicalto f.functionName. Its right-hand side is the result of the transformation appliedto sf . Details discussed in 3.3.6.

Note: If accessed from a stack the appropriate model proxy need not be passed througha rule parameter.

Page 51: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

3.3. TCS-to-Grammar Transformation 37

3.3.5 EnumerationTemplate to Production

Enumeration templates can be translated to Rats! grammars very easily. An enu-meration’s literals are alternatives and represent only string literals. While theautomatic mode gets the string literals directly from the enum literals the languagedesigner can specify string literals representing the enum literals explicitly.

This leads to the following relation:

Input 3.3.5.1 (EnumerationTemplate). Let te be the enumeration template to trans-form. Automatically or manually, a set SL of string literals can be derived from te.

Output 3.3.5.1 (Production). The result is pe returning Object and name derivedfrom te by means of nomenclature nom. Its right-hand side is an set of alternatives(=ordered choice), one for each element sl ∈ SL of the form

(Aenter, sl, Aexit, Apost)

with injector actions Aenter and Aexit as in 3.3.6.1 and Apost returning an enumliteral with the specified string representation.

3.3.6 tcs.Sequence to xtc.Sequence

A more intuitive part of the transformation concerns the TCS element Sequence.Everything that is on the right-hand side of a template is a sequence consisting of0..* SequenceElements .

As can be seen in Fig. 2.8, concrete instances of SequenceElements can be of varioustypes. Some of them are more related to the structure of the concrete syntax andtheir grammar representation is relatively straightforward (Block or LiteralRef ).Others require inspection into the underlying metamodel MDSL (Property andFunctionCall ) or special actions for the parsing observer (InjectorActionsBlock orAlternative ).

A general pattern of sequences is nesting: Fig. 2.6 shows that three sequence elementsubtypes can be (or have) sequences themselves. For example, a sequence can have0..n blocks, which themselves contain one or more sequences. This leads to a circlein the diagram representing the nested structure.

Input 3.3.6.1 (SequenceElement abstract). Let s be a sequence element of type nofurther specified appearing within a template t.

Output 3.3.6.1 (xtc.parser.Sequence). Regardless of whether s is atomic or nesteda tcs.Sequence sxtc results with sxtc = (Aenter, s

′, Aexit). Actions Aenter and Aexit

are notification to the parsing observer, s′ is the transformed result of the concretesubtype of SequenceElement (see below).

Page 52: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

38 3. Design

If s is an instance of Alternative , Block or ConditionalElement there is a nestedsequence (conditionals have two references which can be, however, regarded as twoalternatives being a nested structure again) to be transformed, too.

Input 3.3.6.2 (SequenceElement, nested). Let s be of nested element type and snits nested sequence.

Output 3.3.6.2 (xtc.parser.Sequence). The result of the transformation τ of anested sequence element is a sequence sxtc containing all nested elements concate-nated (denoted by �):

τ(s) = sxtc =⊙

τ(s′) ∀s′ ∈ sn

Additionally, Blocks need parentheses surrounding the sequence and (optionally)line breaks. Alternatives require separation of their elements by ”/” for an orderedchoice. Conditionals always represent optional elements producing two alternatives(one of which is empty).

While these definitions are recursive atomic sequence elements such as Property ,LiteralRef , FunctionCall and InjectorActionsblock can be transformed directly.

Property

Probably the most important sequence element in TCS is Property . It refers to astructural feature of a model element. If a property appears as part of a sequencethis means that the syntax at that location. From its meta type and the syntaxlookup the transformation can infer the rule to be called and whether the part isrequired or repeatable or optional. The general pattern for the transformed result(textual notation for clarity) is always

(temp i:template rule name { setProperty(...) })* or(temp i:template rule name { setRef(...) })*

However, property arguments impede an entirely straightforward solution just call-ing the rule representing the appropriate template.

Most nontrivial situations arising from additional property arguments need handlingby the model injecting facility. But some are related to the syntax and parsergeneration and are hence discussed in the following:

• AsPArg : instead of the template inferred from meta lookup the specifiedprimitive template must be called.

• ModePArg : modes added to a property enforce execution of a specific variantof a class template. The transformation needs to call the appropriate rule.Therefore, the required nomenclature defined in 3.3 takes a mode as secondparameter to uniquely identify the rule.

Page 53: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

3.3. TCS-to-Grammar Transformation 39

• SeparatorPArg : if a separator is specified for multivalued properties this hasto be added to the sequence that is a repeatable element.

• ForcedLowerPArg and ForcedLowerPArg : multiplicities of features that areoverridden in the syntax specification are supposed to ensure a minimum(maximum) number of elements. Since this cannot be mapped elegantly5toa grammar supporting only repeated (*) and optional (?) elements the TCSfeature is ignored.

Input 3.3.6.3 (Property). Given a property prop with property argumentsprop.mode and prop.as referencing primitive template tP ∈ TP . Let MP be thetype of the structural feature and prop.sep be the optional separator argument.

Output 3.3.6.3 (Quantification). A quantified element qprop which, in turn, con-tains two required elements: a Binding bprop and an injector action Ainj. Optionally(depending on prop.sep) there is a third element: a Terminal containing the separa-tor.

bprop is bound to the semantic value of a nonterminal resulting from the syntax lookupof prop’s template uniquely identified by

nom(MP , prop.mode)

or which is simply tP .

Ainj is not detailed here since it requires discussion of all other property argumentssuch as createAs , refersTo , query etc.

InjectorActionsBlock

As depicted in Fig. 2.8, an InjectorActionsBlock contains 0..* PropertyInits. Trans-forming them does not concern the syntax analyzer as it affects only how modelelements are created or updated.

Thus, details of the TCS injection mechanisms are not discussed. Code generatedfrom the transformation of PropertyInits can be inserted similarly as in the ANTLRversion (in curly braces). There are, however, technical consequences that arise fromthe backtracking nature of all Rats! generated parsers. These are covered in theimplementation Ch. 4.

FunctionCall

A function call always references its function template. Given a correct syntaxdefinition the grammar must already contain a transformed result of that template.Calling this function works by merely inserting the nonterminal associated with thetransformed template. When implemented with a model proxy stack no passing ofarguments is needed.

5Apart from unfolding the forcedLower argument n to a sequence of exactly n elements plus arepetition

Page 54: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

40 3. Design

LiteralRef

String literals appearing as part of a sequence can be trivially transformed to gram-mar elements. Since they represent keywords in the language definition their trans-formation is detailed in 3.3.7.

3.3.7 Keywords and Symbols to Productions

Keywords and symbols are explicit specifications for what is to be considered animmutable unit of lexical syntax. The difference between keyword and symbol isblurred by the fact that for literal reference consisting of more than one character itis hard to guess if the quoted literal is considered a keyword or symbol.6

The design for the transformation of keywords and symbols to grammar rules needsto take into account that Rats! does not produce tokens which are definitely veryuseful for error reporting and editor features such as code completion. Usually, tokeninformation in TCS is specified by the following elements:

• literal references appearing as sequence elements are automatically consideredkeywords

• symbols and keywords may be specified additionally in the according TCSsection

• lexer code can be provided in the lexer section overriding the default lexerimplementation of ANTLR

Building tokens on the fly, i.e. during the parsing process, requires for all differenttokens a dedicated rule that can be observed via micro-transactions detailed in Sec.4.3.

Formally stated, we need the following relation:

Input 3.3.7.1 (Keywords and Symbols). The set of keywords K can be separatedinto sets of explicitly defined (Kdef) and referenced (Kref) keywords: K = Kdef ∪Kref . These sets need not be disjoint.

The same is true for symbols (let S, Sdef and Sref be the according symbol sets).

Output 3.3.7.1 (Literal Produtions). For each k ∈ K and for each s ∈ S a pro-duction p results returning a String representing the literal. The production must bestateful for transactional handling and transient to suppress memoization.

An init-action is needed in order to communicate to the parsing state observer thatcommitting this transaction leads to creation of a new token.

Note: details of both production attributes (stateful and transient ) and theireffect to the transformation are described in the implementation Ch. 4.

6A possible solution for that can be assuming symbols are never alphanumeric characters, whichis true for most languages.

Page 55: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

4. Implementation

To evaluate the feasibility of language composition technical issues arising fromemploying a scannerless parser generator need to be discussed. For this purpose aprototypical implementation of the transformation exhibited in Ch. 3 was developed.The existing textual editing framework was modified in order to use the Rats! parsergenerator instead of ANTLR . When migrating to the scannerless parser technologythe following specific questions have been of interest:

• Given the differing paradigm, is it possible to implement a transformation fromTCS to Rats! grammar such that for all possible mapping specifications andabstract syntax models a domain parser results?

• How can pre- and post-actions like the ones specified in Sec. 3.3 be added aspure Java code to permit injections (model creations) during the parse?

• Does the backtracking nature of every Rats! -generated parser pose a problemwhen creating model elements or tokens?

• How does migrating to a token-free environment affect error reporting anderror recovery?

• What are the obstacles when striving for incrementality (with respect to bothlexing and parsing since they are to be integrated)?

• What technical requirements are placed on the generated parsers consideringthe integration into the textual editing framework?

Most of the above questions could be dealt with when implementing the prototype.The bottom line is that with features provided by Rats! , such as nested transactions(see Sec. 4.3), and a combination of a unique naming function for variable bindings(see Sec. 4.2.3) with a heuristic (see Sec. 4.4.3) ordering the alternatives appearingon a right-hand side the designed transformation can be realized.

A remaining unresolved issue is the question how an incremental version of theparser can be integrated into the editing framework. It is doubted that small mod-ifications to the generated tokenizing parser are sufficient to support incrementality

Page 56: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

42 4. Implementation

with respect to both parsing and lexing. We consider this a separate topic whichgoes beyond the scope of this work. Requirements for a solution are discussed inSec. 4.6.

4.1 Handler-Based Transformation

As pointed out in the design chapter (Sec. 3.1.2) the transformation from TCS toRats! could be implemented by a model-to-model transformation or coded in Java.The prototypical implementation uses specialized handlers for the most importantRats! metamodel elements to create a Rats! grammar instance.

Figure 4.1: Handler-based transformation of TCS instances

Implementing the transformation with special purpose classes for the main modelelements was chosen for the following reasons:

• Testing: essential for the correctness of the whole transformation is the cor-rectness of each transformed element. JUnit tests can be derived from thedifferent handlers and the expected results.

• M2M engine: using a model-to-model transformation brings forth anotherexternal library with potentially low-performing transformation engines andadditional dependencies. The author’s experience with relational QVT, forexample, suggests that in complex transformations a clean and concise trans-formation specification is hard to develop (see [Kus08])

• Concept Reuse: From the ANTLR -based framework a significant amountof code is similar to the code needed for the transformation to Rats!

4.2 Packrat Parser Specifics

Some details of the code generator provided by the Rats! parser generator frame-work need to be considered in order to produce the correct domain parsers fromthe generated grammar. The most important and specific detail of the code gen-eration is the fact that all generated parsers use memoizing. All created parsersare backtracking which usually implies exponential time complexity. The idea of

Page 57: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

4.2. Packrat Parser Specifics 43

packrat parsers is to store intermediate parsing results, called memoizing, in orderto guarantee linear-time complexity at the cost of additional space consumption.Details affecting the implementation are discussed in Sec. 4.2.1.

Another critical point during code generation may be the various parser optimiza-tions that Rats! employs. As opposed to memoizing, these can be deactivated,however. Additionally, they are mostly related to literal and transient productions.See Sec. 4.2.2.

Action code must be added directly to the output grammar for model creatingactions and calls to the parsing observer. The protoypical implementation mustmake assumptions about how action code is inserted into the generated parser code.Issues resulting from those assumptions are detailed in Sec. 4.2.3.

4.2.1 Memoization

Common to all backtracking parsers with unlimited lookahead is an exponentialworst-case complexity. Obviously this is unacceptable and must be taken care of.Packrat parsers use memoization to avoid re-parsing of input already processed.With additional storage overhead for the intermediate results, packrat parsers ex-hibit a linear time complexity and are thus of practical relevance and applicable tolarger inputs, too.

Technically, memoization is implemented by a lookup table table (referred to asmemoization table). For dedicated nonterminals, the table stores the result obtainedfrom invoking the nonterminal’s method at a specified index, i.e. such a table is amap of the form Index → Result. If the result is present at the specified index itis retrieved from the table. If none is present, the method to parse the construct isinvoked and the table is filled.

As shown in detail in Sec. 4.3 and 4.5.3 transactional handling of methods is usedfor model injection and token creation. In a way, memoization violates the trans-actional contract that the start, commit and abort methods are invoked whenevera nonterminal is parsed. This is illustrated in Listings 4.2 and 4.3. It shows thegeneral pattern of memoized nonterminal productions. Consider the following TCSextract for EnumLiteralMappings (chosen because of their syntactical simplicity):

1 template TCS::EnumLiteralMapping2 : literal ”=” element3 ;

Listing 4.1: Memoization example - TCS mapping snippet

It states that an EnumLiteralMapping is represented in concrete syntax as pair ofEnumLiteral literal and the associated SequenceElement element separated by anequals symbol. In the produced Rats! grammar one rule (tcs enumliteralmapping iscreated as follows:

1 stateful Object tcs enumliteralmapping =2 −− pre action (create model proxy)34 temp 1:tcs enumliteralval {set(proxy, ” literal ”, temp 1);} Spacing EQ

Spacing

Page 58: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

44 4. Implementation

5 temp 2:tcs sequenceelement {set(proxy, ”element”, temp 2);} Spacing67 { yyValue = commitCreation(proxy, null, false); }8 ;

Listing 4.2: Memoization example - simplified generated grammar

The nonterminal tcs enumliteralmapping appears repeatedly throughout the wholesyntax specification which causes the code generator to create two methods: a virtualrule tcs enumliteralmapping and the actual parsing rule tcs enumliteralmapping$1.When performing the first parse of a nonterminal tcs enumliteralmapping at a givenindex the state modifying methods start() , and eventually commit() or abort()are invoked following the transactional paradigm. If the same nonterminal is parsedagain (because an alternative in a calling rule aborted) the parsed result is retrieveddirectly from the memoization table (yyColumn.chunk1.ftcs enumliteralmapping

) bypassing the transactional methods.

1 private Result ptcs enumliteralmapping(final int yyStart) throwsIOException {

23 TCSColumn yyColumn = (TCSColumn)column(yyStart);4 if (null == yyColumn.chunk1) yyColumn.chunk1 = new Chunk1();5 if (null == yyColumn.chunk1.ftcs enumliteralmapping)6 yyColumn.chunk1.ftcs enumliteralmapping7 = ptcs enumliteralmapping$1(yyStart);8 return yyColumn.chunk1.ftcs enumliteralmapping;9 }

101112 private Result ptcs enumliteralmapping$1(final int yyStart) throws

IOException {13 Result yyResult;14 Object yyValue;1516 yyState.start () ;1718 // PRE ACTION − create model proxy etc.1920 yyResult = ptcs enumliteralval(yyStart);21 if (yyResult.hasValue()) {22 Object temp 1 = yyResult.semanticValue();23 set(proxy, ” literal ”, temp 1);2425 yyResult = pSpacing(yyResult.index);26 if (yyResult.hasValue()) {27 yyResult = pEQ(yyResult.index);28 if (yyResult.hasValue()) {29 yyResult = pSpacing(yyResult.index);30 if (yyResult.hasValue()) {31 yyResult = ptcs sequenceelement(yyResult.index);

Page 59: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

4.2. Packrat Parser Specifics 45

32 if (yyResult.hasValue()) {33 Object temp 2 = yyResult.semanticValue();34 set(proxy, ”element”, temp 2);35 yyResult = pSpacing(yyResult.index);36 if (yyResult.hasValue()) {37 yyValue = commitCreation(proxy, null, false);3839 yyState.commit();4041 return yyResult.createValue(yyValue);42 }43 }44 }45 }46 }47 }48 yyState.abort();49 }

Listing 4.3: Memoization example - simplified parser code

For that reason, tokenization in the presence of memoization is more complex andspecial post-processing must be carried out. Fortunately, for the model injectioncode this behavior is acceptable due to the following observation:

Lemma 4.1 (Memoization). Model elements are created correctly even in the pres-ence of memoization.

Without a formal proof the lemma is motivated in the following:Let prod be the stateful, memoized template production. If prod is called only oncethere is no difference to a non-memoized parse. So the first call of prod can havesucceeded (implying a proxy object in the memoization table) or failed (implying anull object in the table).

By context-freeness and the observation that references are set outside any transac-tional operations we can conclude that in either case, retrieving the object from thememoization table will lead to the same result as parsing the syntax again.

4.2.2 Parser Optimizations

The Rats! parser generator includes a number of optimizations. Table 4.2 liststhe available options. Most of them are tuned to increase throughput. The mostimportant are Chunks, Transient, Repeated and GNodes. Heap utilization is a secondmajor advantage which is decreased mainly by the options Chunks and Transient.

For the generation of domain parsers with high performance optimizations play asignificant role. Still, for a deterministic result of the transformation some of theoptimizations can cause problems. For the prototypical implementation discussedhere we chose to switch off all optimizations in order to guarantee that code frag-ments inserted into the grammar code appear at the right spot within the generatedparser code. Especially factorization of common prefixes can lead to syntax errorsin the generated code and was therefore deactivated.

See Sec. 4.2.3 for typical errors resulting with action code.

Page 60: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

46 4. Implementation

Name Description

Chunks Organize memoized fields into chunks.Grammar Fold duplicate productions and eliminate dead productions.Terminals Optimize recognition of terminals, incl. using switch statements.Cost Perform cost-based inlining.Transient Do not memoize transient productions.Nontransient Automatically recognize productions as transient.Repeated Do not desugar transient repetitions.Left Implement direct left-recursions as repetitions, not recursions.Optional Do not desugar options.Choices1 Inline transient void and text-only productions into choices.Choices2 Inline productions that are marked inline into choices.Errors Avoid creating parse errors for embedded expressions.Select Avoid accessor for tracking most specific parse error.Values Avoid creating duplicate semantic values.Matches Avoid accessor for string matches.Prefixes Fold common prefixes.GNodes Specialize generic nodes with a small number of children.

Figure 4.2: Rats!’ parser optimizations (from [Gri06])

4.2.3 Actions and Bindings

Rats! offers a convenient way to put action code into a grammar. The injected codecan be arbitrary Java code snippets and is not subject to any syntactic restrictions.That is why the transformation engine needs to take that the combined parser codeis valid Java code. Especially two items are critical:

• Duplicate local variables: bindings to grammar elements, i.e. the semanticvalue of an executed parser method assigned to local variables may result induplicate declarations. This is due to the fact that when stating a bindinglike temp:seqElem1 , Java code is generated assigning the return value of themethod pSeqElem1 to a newly declared variable temp .

In a TCS sequence numerous sequence elements are parsed one after anotherleading to duplicate local variables when the identifier name is not unique.Therefore, the implementation always creates bindings subscripted with as-cending integers (per sequence).

• Undeclared variables: rules created from TCS operator templates are usuallyresponsible for parsing the operator symbol which is, however, associated withthe calling rule created from an abstract operatored class template (see 3.3.3for details). Passing the operator symbol back to the class template rule canbe done by accessing a global variable opSymbol . But there is no guaranteethat this variable lies within the same scope as the code where the symbol isbound. Memoized chunks and factorized method parts may complicate thesituation.

Page 61: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

4.3. Lightweight Nested Transactions 47

4.2.4 Parameterized Rules

Context-freeness of the languages discussed here usually implies that a producedRats! rule is also context-free in the sense that it does not require informationfrom outside the scope of the rule. This is not completely true for the case ofOperatorTemplates . The general pattern for operator templates is that the model isbuilt up from model elements that are created before or after parsing the syntax forthe operator. By passing the result of the previous sequence elements (primaries tobe specific) to the operator rule it is possible to build the model correctly respectingboth arity and associativity.

In contrast to ANTLR , Rats! does not allow parameterized rules, however. Sothe implementation uses a stack holding the model element usually passed in. Asoutlined in Sec. 3.3.3 the push and pop operations surround the call to the respectiveoperator rule allowing to gradually build the model for the compound expression.

4.3 Lightweight Nested Transactions

Productions of a Rats! grammar can be tagged with the attribute stateful al-lowing to specify functionality associated with the success of failure of a rule. Theoperations performed upon success or failure of a production can be nested in thesense that a transaction can start other transactions before finally ending (successfulor unsuccessful).

In combination with the code that is injected by the engine transforming each TCSelement into a grammar item transactional operations are crucial to the function-ing of the outlined framework. Typically, starting a transaction may lead to theconstruction of model element proxies while successful completion can imply theirresolution to actual model elements. Another example of behavior associated withthe completion of rules is the creation of tokens (discussed in detail in Sec. 4.5.3).

The semantics of stateful Rats! productions is as follows. Consider the followingrule:

stateful Object C = A / B / "text";

The nonterminal C can be expanded to A, B or the string literal alternatively. Thekeyword stateful indicates that the specified implementation of the State interface(as depicted in Fig. 4.3) observes the parsing process. That is:

• start() is called before C is expanded.

• commit() is called after the first successful parse of a C alternative. Here,nesting of transactions comes into play. Before a commit operation is calledeither of the alternatives has to be completed successful.

• abort() is called when all alternatives of C have failed.

An important fact is that the transactional methods are parameterless. So theState interface observing the parse does not know which rule starts, commits oraborts. In two cases, the generated parsers are transactional: rules creating proxies

Page 62: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

48 4. Implementation

State

start() : voidabort() : void

commit() : voidreset(s : String) : void

ParserBase

RatsObservablePatchedParser

IParsingObserver

#observer

IModelInjector

createOrResolve()

DSLParserparamStack : OperatorTemplateParameterStack

<<final>>

InjectorStatetokenLookup : Map<Integer, Token>currentToken : Tokentokens : List<Token>tokenStack : Stack<List<Token>>isTemplateRule : BooleanisTokenizing : BooleanproxyStack : Stack<NamedProxyWrapper>

sanityCheck() : voidnewToken(yyStart : Integer, ttype : Integer) : voidgetCurrentProxy() : IModelElementProxygetCurrentMetaType() : List<String>

<<implements>>

#state

RatsObservableInjectingParser

pmain(yyStart : Integer) : Object...

-injector-parser

File: C:\Users\c5126086\Martin_DA\thesis\figures\src\xtc.parser.mdl 22:26:08 Montag, 19. Oktober 2009 Class Diagram: xtc.parser / Main Page 1

Figure 4.3: MOF class diagram of xtc.parser.InjectorState

(i.e. generated from tcs.ClassTemplate ) need to be stateful for correct creation anddeletion of proxies and resolution of references. The other type of rules that aretransactional by default are lexical (=tokenizing) rules. When a lexical rule abortsbecause a token cannot be recognized completely, all tokens created by its subrulesmust be rolled back. This can only be achieved in the abort method.

The implemented InjectorState 4.3 has therefore additional flags (isTemplateRule, is-Tokenizing) indicating whether a rule produces model proxies and/or tokens. Thesevalues can only be set correctly outside the transactional methods. According ac-tions are inserted at the beginning of rules generated from templates or literals.

Summing up, the state-modifying transactions provided by the Rats! parser-generatorfacility are essential for the implementation of a model-injecting domain parser. Still,some customization needed to be implemented to allow for handling different typesof transactional rules.

4.4 Ambiguities

Every grammar containing ambiguous constructs suffers from the problem that noparser generated from it will be able to decide which is the right derivation for agiven input. A prominent characteristic of all parsers relying on parsing expressiongrammars (PEGs) is that the alternatives are ordered (see 4.4.2 for implementationalconsequences) and that they hence avoid ambiguities. This sounds intriguing but theresult is that the author of a PEG will have to consider the ordering of alternativescarefully to define the correct parse. Ambiguity can be illustrated with the well-known dangling-else problem. Without any sentinel identifying the beginning and

Page 63: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

4.4. Ambiguities 49

the end of a statement (usually curly braces) nested if-then-else statements of thefollowing form are ambiguous

if Exp1 then if Exp2 then Stmt2 else Stmt′

2︸ ︷︷ ︸innermost

It is unclear whether the the else-clause belongs to the first or the second if-statementresulting in two different parse trees. Grammars for languages like C or Java arbi-trarily decree that the else clause always belongs to the innermost if-statement. Thisbehavior is captured in the grammar by distinguishing open and closed statementsresolving the ambiguity. Such precedence solves the ambiguity.

In Rats! , the dangling-else problem cannot be solved more precisely. However,ordering of alternatives makes it easier to define precedence rules within a statement.With the same precedence of innermost else clauses a statement of the form

if Exp1 then if Exp2 then if Exp3 then Stmt3 else Stmt′3

cannot be nested as follows without explicit braces:

if Exp1 then if Exp2 then if Exp3 then Stmt3︸ ︷︷ ︸one

else Stmt2︸ ︷︷ ︸two

As can be seen from the above examples the author of a Rats! grammar needs totake care of precedence in order to obtain the desired parse. Since in our case thegrammar is generated from a mapping definition the transformation is in charge ofestablishing the correct order of alternatives (see 4.4.3) or provide other means toavoid wrong derivations (see 4.4.1).

4.4.1 Greedy Parse - Shift/Reduce Conflicts

Processing the TCS.tcs revealed a fundamental problem that arises in situationsthat are related to abstract operatored templates and their transformation to a gram-mar. Usually, operatored templates are employed when a concrete model elementtype needs to be created as a compound expression that can be of any subtype of ageneric expression type, allowing for operatored expressions - which would otherwiselead to a non-LL left recursion in the produced rules.

Consider Fig. 4.4 as an illustration. The generic binary expression type refers toa left- and a right-hand side of the abstract type Expression . However, BinaryExpis not abstract itself. Assume we have an operator list with only one priority levelcontaining only one binary left-associative operator op. A mapping with an abstractoperatored class template for Expression and an OperatorTemplate for BinaryExpwill result (according to transformations stated in 3.3.2.3 and 3.3.3) in a grammarwith a priority 0 rule

priority 0 ← primary expression ( op binary exp primary expression )*

Page 64: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

50 4. Implementation

IntegerLitvalue : Integer

BinaryExpopName : String

Expression

+opLeft

+opRight

File: C:\Users\c5126086\Martin_DA\thesis\figures\src\opExp.mdl 18:09:26 Mittwoch, 14. Oktober 2009 Class Diagram: Logical View / Main Page 1

Figure 4.4: Sample metamodel illustrating operatored expressions

and associated actions pushing the result of the primary rule onto the stack (forprocessing by binary exp setting it as left-hand side of BinaryExp ) and an after actionsetting the result of the last primary expression to its right-hand side (omitted herefor better readability). Parsing an input such as 1 op 2 op 3 will yield op(op(1, 2), 3)as desired.

As pointed out in the beginning, the syntax mapping for the TCS language itselfcontains more sophisticated constructs that lead to rules in accordance with thespecified transformation, but which are no parsable inputs using simple operatoredconstructs.

12 template Model::Namespace abstract operatored(DBLCOLON);34 template Model::Classifier referenceOnly5 : ( isDefined(container) ? container ” :: ” name : name)6 ;78 template Model::GeneralizableElement referenceOnly9 : name

10 ;1112 operatorTemplate Model::ModelElement(operators=opDlColon,

source = container) referenceOnly13 : name14 ;

Listing 4.4: TCS.tcs: concrete syntax for classifiers and namespaces

For references to model elements, TCS contains a referenceOnly template for Clas-sifiers from the M3 model. These can (and most often will) be fully-qualified. Theabstract syntax is partially depicted in 4.5. All model elements are abstract so thata Classifier has an optional container of type Namespace and a name attribute.

In concrete syntax, this containment relationship is expressed by a series of packageor class names with double colons separating the name spaces from each other. Thisgives rise to the syntax listed in 4.4. The conditional element in Classifier constitutesthe critical element here. In the direction from abstract to concrete syntax it istotally clear meaning ”display the container (if present), then a double colon, thenits name”.

Page 65: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

4.4. Ambiguities 51

Namespace

ModelElementname : String

0..1

0..n

+container

0..1

+containedElement

0..n{ordered}

GeneralizableElement

Classifier

File: C:\Users\c5126086\Martin_DA\thesis\figures\src\m3.mdl 19:29:08 Mittwoch, 14. Oktober 2009 Class Diagram: Logical View / Main Page 1

Figure 4.5: MOF extract: namespaces and classifiers

However, the other direction is less intuitive. Parsing concrete syntax means answer-ing the question how the textual parts should be represented as model elements.Therefore, conditional elements are represented as an alternative having the twochoices stated in the then- and in the else-clause.

A sample input ”PrimitiveTypes::String” should be parsed to a tree as depictedin 4.6. However, after parsing the first identifier the parser tries to repeat theelement on the right. The input can be matched (for one repetition), so rulemodel namespace succeeds. But then no input is left for the following doublecolon and identifier in model classifier and the rule throws a parse error.

The discussed issue is an instance of a typical shift/reduce conflict known fromLR-parsers. However, it cannot be easily detected in a scannerless environment.

Disambiguation with syntactic predicates

Comparison with an ANTLR generated parser showed that the ANTLR versionparses the input correctly because of automatically added syntactic predicates withinthe repeated element:

dblcolon priority 0 ← primary model namespace ( "::"

model modelelement &"::" )*

Finding the correct spot for placing a syntactic predicate is not a trivial task. Inthe above example it is imperative to put the syntactic predicate into the repeatedelement. Otherwise, the rule always succeeds with too many repetitions and thecalling rule fails. This is why the prototypical implementation does not contain anyautomated injection of predicates but relies on patches to the resulting grammarinstead.

4.4.2 Ordering of Choices

In Rats! grammars, the alternatives on the right-hand side of a production are pri-oritized, i.e. they represent an ordered choice. This is intended to avoid ambiguities

Page 66: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

52 4. Implementation

Figure 4.6: Parse tree for input ”PrimitiveTypes::String”

of the grammar. While it does avoid situations in which the parser is uncertainwhich nonterminal to expand it does not automatically guarantee the correct (=in-tended) parse. This is due to the fact that the grammar can contain productionswith shadowing alternatives.

Definition 4.1 (Shadowing alternatives). Let Ai and Aj be two alternatives of anordered choice. Then Ai shadows Aj (Ai . Aj) iff Aj succeed on some input β ∈ Σ+

and Ai succeeds on a prefix β′ of β.

If Aj is the correct alternative for some input but Ai (shadowing Aj) has a higherpriority then Ai will always be selected resulting in

• a parse error in the calling rule if β′ is a proper prefix of β (since there will besome input left) or

• an incorrect parse if β′ = β.

Note that two specific situations can be identified resulting in shadowing alternatives:

• alternatives for identifiers usually shadow keyword alternatives

• the alternative with an empty sequence (ε-alternative) shadows all possiblealternatives. This situation can be detected by the parser generator, however,and will lead to an error while processing the grammar.

Page 67: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

4.4. Ambiguities 53

This leads to the conclusion that shadowing alternatives should be avoided in orderto guarantee the correct parse. Applying the Rats! generated parser to some validTCS inputs revealed that such situations actually need special handling. For that,consider the element ConditionalElement , Expression , PropertyReference and relatedfrom the TCS metamodel (found in Fig. 2.7 on page 16). The common concretesyntax is specified by the following mapping:

12 template TCS::ConditionalElement3 : ”(” condition ”?” thenSequence (isDefined(elseSequence) ? ”:”

elseSequence) ”)”4 ;56 template TCS::Expression abstract;78 template TCS::BooleanPropertyExp9 : propertyReference

10 ;1112 template TCS::IsDefinedExp13 : ”isDefined” ”(” propertyReference ”)”14 ;1516 template TCS::PropertyReference17 : (isDefined(strucfeature) ? strucfeature{refersTo=name, query

=”OCL:let ...”, as = identifierOrKeyword}18 : ”−>” name{as = identifierOrKeyword})19 ;

Listing 4.5: TCS.tcs: conditionals and expressions

The rule generated from the abstract template for Expression contains alterna-tives for all subclasses, especially for the listed IsDefinedExp and BooleanProperty-Exp. The AsPArg in PropertyReference ’s template leads to a call to a lexical ruleIdentifier for all property references (model creating actions omitted). Thus,since the string literal isDefined can be parsed as identifier we have:

tcs booleanpropertyexp . tcs isdefinedexp

leading to a parse error on each occurrence of an isDefined expression.

4.4.3 A Heuristic for Shadowed Alternatives

The typical identifier-vs.-keyword conflict in alternatives produced from a TCS map-ping (as illustrated in 4.4.2) can be fixed revising the ordering of alternatives in arule. The heuristic takes into account the following:

• Two alternatives starting with different string literals cannot shadow eachother

• A lexical rule alternative can shadow an alternative starting with any non-empty fixed string literal, but not vice versa.

Page 68: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

54 4. Implementation

• Two alternatives starting with a lexical rule may be ordered such that longeralternatives precede shorter ones, i.e. in a setting with rules1

A ← B / CB ← DC ← D ′(′ E ′)′

D ← [a− zA− Z][a− zA− Z0− 9 ]∗

Rule B shadows C and consequently the ordering of A’s alternatives must be:C preceding B.

Another way to solve the issue of shadowing alternatives is via syntactic pred-icates, similar to the ones inserted as described in Paragraph 4.4.1. Especiallyin situations where nonterminal A is expanded to a B in almost all cases theordering C preceding B would be a performance drawback. But since Rats!factors out the common prefixD of B and C and memoizes intermediate resultsthis argument can be disregarded.

• The ε-alternative (=empty right-hand side) must be the last of all choices.

4.5 Tokenization and Scannerless ParsingAs pointed out several times before Rats! does completely without tokens. Nolexical analysis is performed prior to syntactic analysis and parsing is performedon a character-by-character basis. While this has been the central argument forchoosing a scannerless technique in the first place, some significant drawbacks resultfrom the absence of tokens including:

• Difficult error reporting based on expected characters only;

• Limited chances for applying well-studied error recovery techniques such aspanic mode recovery since they are based on designated synchronizing tokens(see for example [ASU86]).

That is why the implementation outlined here supports token creation as part ofthe parsing process. It integrates lexing with parsing by emitting tokens duringsyntactic analysis - just as a DFA would do in a prior lexical analysis step. However,in contrast to a classical scanner2 the tokenizing scannerless parser can decide whichtoken type to assign based on contextual information.

Traditional lexers are usually defined as a set of rules associating regular expressionswith token types. These can be used to create a combined DFA with the designatedcharacters transitioning, finally, to an accepting or error state. Token types arevery specific to grammar definitions. But, common to most lexers is the notion ofcharacters which separate one token from another. Sec. 4.5.1 details how these canbe implemented.

The process of creating tokens on the fly, both conceptually and technically, isorthogonal to the question what a token is. The solution presented here makesuse of Rats! micro transactions to ensure correctness of tokenization even in thepresence of backtracking. Details are discussed in Sec. 4.5.3.

1This could be rewritten as A ← B ( ′(′ E ′)′)?. However, as the grammar is auto-generatedsuch beautifications are (in general) not possible.

2the terms scanner, tokenizer and lexical analyzer are used as synonyms throughout the work

Page 69: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

4.5. Tokenization and Scannerless Parsing 55

4.5.1 White Space Definition

In traditional compiler construction the definition of white space characters plays asecondary role. In most cases the default definition of what is considered a blankcharacter will suffice to separate tokens. The scannerless paradigm demands specialtreatment of blank characters however as they may affect the parse.

Parser generator frameworks provide means to specify lexical rules separate fromsyntactic rules. In ANTLR for example, lexer rules may be composed from regularexpressions. These rules can be identified by a non-terminal starting with a capitalletter.

In the Rats! scannerless environment we need to distinguish two types of blankcharacters: blanks and required blanks.

Definition 4.2 (Blank Character). A character (or sequence of characters) is consid-ered blank if adding it between to tokens of the input does not change the semantics.

Indentation and comments are typical blank sequences since adding them to a pieceof syntax does not change the semantics of the code.

In contrast, not all blank character symbols are dispensable. Some are needed inorder to separate two tokens of variable length, e.g. given by a regular expression.This gives rise to the second definition

Definition 4.3 (Required Blank). A blank character is required if removing itchanges the type of the created tokens.

Example 4.1. Consider the code snippet keyword someIdentifier . If the blank isomitted the lexer cannot recognize the keyword and will lex the construct as identifier.Thus, this occurrence of a blank character is required.

The above distinction between blanks and required blanks is necessary due to somespecifics of the Rats! parser generator. While parser generators with a separatelexing phase will produce only one token from a keyword and a following identifierwithout a separating blank Rats! parses the syntax characterwise and reaches anaccepting state when a literal is recognized. This leads to the conclusion that thephrase discussed in Example 4.1 needs an additional separator recognizing at leastone blank character.

Fig. 4.7 shows the automaton representing the lexer with forced recognition of awhite space between keyword and identifier.

1 2 3 4key ws

ws

id

1

Figure 4.7: DFA recognizing keyword followed by identifier

In the generated Rats! grammar, blanks and required blanks are recognized by twodifferent default rules:

Page 70: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

56 4. Implementation

• Spacing for blanks depending on rules WS and COMMENT representing a singlewhite space character and a comment respectively

• FSpacing for required blanks analogously; multiplicity is ”+” instead of ”*”

The rules WS and COMMENT are subject to change by the author of the TCS mapping.E.g. syntax for comments can be specified in the respective section of the TCSdocument:

token COMMENT : endOfLine(start = "--");

If no user-defined mappings for comments and white spaces are found the frameworkcreates default rules.

4.5.2 Assignment of Token Types

For several reasons (including more convenient error reporting and recovery) tokensare helpful in the process of parsing a textual artifact. That is why the outlinedimplementation employs a strategy to create tokens during syntactic analysis. Sincethe parsing process is backtracking all created tokens are subject to deletion uponabort of a rule. This can be handled by having the transactional actions start() ,commit() and abort() execute any token creation or deletion.

Definition 4.4 (Token). A token is a 4-tuple T = (tt, s, e, val) where tt is the tokentype, s and e are start and end indices and val is an (optional) token value.

If not otherwise stated tokens refer to the definition of what a token is. In contrast,occurrences of tokens in a textual artifact are referred to as lexemes. The central taskof tokenization is thus assigning tokens to lexemes and setting the token attributes,which is usually done in the scanner phase.

The main benefit of deferring this assignment to the syntactic analysis phase - inte-grating the lexer with the parser - is as follows: when creating token instances theparser can use information from the parse, i.e. the expected type of construct.

4.5.3 Tokenizing via Lightweight Transactions

Technically, the implementation of an integrated lexer/parser architecture must takeinto account aborts of rules while token instances are created because all Rats!generated parsers are backtracking and employ memoization for better performance.The micro transactions discussed in Sec. 4.3 can observe the parsing process andeject or retract the appropriate tokens.

Algorithm 4.1 (Tokenization via micro transactions).

curTok := (0, int max);stack := [ ];tokens := [ ];begin while !EOF do

if stateful start(); fisuccess := call appropriateParseRule(yyStart);

Page 71: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

4.5. Tokenization and Scannerless Parsing 57

if lexical call newToken(yyStart, ttype); fiif stateful

if success call commit(); else call abort(); fifi od

whereproc start() ≡stack.push(tokens);tokens := [ ];

.proc commit() ≡added := stack.peek();added.addAll(tokens);stack.pop();tokens := added;

.proc abort() ≡tokens := stack.pop();

.proc newToken(start, type) ≡curTok := new Token(start, yyCount, type);tokens := tokens ∪ curTok;

.

This is realized via the algorithm listed in Algorithm 4.1. The central observationsare

• Only lexical rules emit tokens.

• For each nested transaction a list of tokens is established holding all tokenscreated after start of the transaction.

• A stack contains elements for each transaction level. After the last commitoperation, tokens contains all tokens created so far, i.e. all tokens when theend of file is reached with a valid parse.

Memoization

A critical point when intercepting parsing methods is memoization: for better per-formance Rats! memoizes intermediate results. Instead of calling a method againwhose result was memoized Rats! simply accesses the memoization table and returnsthe semantic value associated with the parse. In a way, this violates the general as-sumption that execution of all stateful rules can be intercepted by the transactionalmethods.

For the process of tokenization, this inconvenient fact requires some additional stor-age of tokens created in a transaction that will be aborted. A java.util.HashMapmapping indices to tokens is used to guarantee that no gaps exist in the output listof tokens. After finishing the parse of the entire artifact tokens a sanity check iscarried out according to 4.2.

Page 72: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

58 4. Implementation

With the presented algorithms tokens can be created effectively on the fly. Generally,this apples to parsing of a complete syntax only. However, the algorithms can bemodified towards incrementality of the generated parsers, too. Such changes includeadjustments to the token indices in consideration as well as a method to locallyupdate the set of newly created tokens without invalidating all tokens created sofar.

Algorithm 4.2 (Sanity check for emitted tokens).

ret tokens := [ ];idx := −1;hasAdded := false;do forall tok ∈ tokens

if !hasAdded ∨ idx ≥ tok.lowerthen

tok := tokens.next();hasAdded := false;

fiif tok.lower = idx+ 1

thenidx := tok.upper;ret tokens := ret tokens ∪ tok;continue;

elsehasAdded := true;tok′ := lookup(idx+ 1);idx := tok′.upper;ret tokens := ret tokens ∪ tok′;continue;

fiod

4.6 Challenges

As a prototypical implementation the proposed grammar generation neither claimsto be stable nor complete. Open issues remaining include an automatic injection ofsyntactic predicates instead of grammar patches detailed in 4.4.1.

As part of an editing framework, the generated model-injecting parsers need to beintegrated into the editor environment. For integration into a textual editing frame-work parsers generated for domain-specific languages need to support three criticalfeatures: incrementality, error handling and error recovery. These are discussed inthe following.

4.6.1 Incrementality

As a highly user-interactive process textual editors must provide means to keep thelatency resulting from parsing, model resolution etc. as low as possible. Completelyparsing a large document with a fair amount of model updates or creations stillneeds seconds to finish. This is unacceptable for an interactive environment. For that

Page 73: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

4.6. Challenges 59

reason, only incremental changes to the textual artifact must be re-lexed and -parsed.The top-down recursive descent parsing technique Rats! relies on is considered tosupport incremental re-parsing of text regions similar to parsers generated by theANTLR framework.

Lexing - or tokenization - (because performed here during syntactic analysis) isregarded a more complex task. The incremental lexing algorithm currently imple-mented (similar to [CW01]) is not applicable to the scannerless parsing techniquesthat Rats! parsers apply. The strict separation of the two components requires fullredesign of token handling.

When tokens are emitted on the fly a set-based or list-based storage mechanism oftokens may be more appropriate than the typically used stream-based data structure.The TextBlock-based approach presented in [GBU09] must be adapted with the newtokenizer. Some parts of the incremental lexical analyzer are no longer necessarywith the scannerless approach. This includes the calculation of lookback-indexesfrom lookahead-values originating from the deterministic finite automaton (lexer).

Granularity of re-parses must be considered. There is a one-to-one relationshipbetween rules in Rats! grammars and generated parser methods. For incrementalparsing, these methods are invoked with the designated partial input of the textualartifact. The textblock decorator approach was designed to minimize the workneeded when some textblocks are changed. The question what is the minimumimpact must be answered for the scannerless parsing technique.

4.6.2 Error Handling

Syntactic errors in the textual artifact should be notified to the user with properindication of the region that could not be parsed. Tokens are the syntactic unitspreferably proposed as expected constructs. Suggestions what constructs are ex-pected are favored over constructs that were not able to parse. Semantic errorscommunicated to the user must contain meaningful information what elements couldnot be resolved.

4.6.3 Error Recovery

Especially for editing environments, error recovery is an important feature. Duringediting of a textual artifact the code will contain errors most of the time. Powerfuldevelopment environments therefore provide recovery for the most frequent syntacticerrors: duplicate tokens and missing tokens. Panic-mode error recovery (see [ASU86]for details) is a feature most likely requested by a textual modeling framework. Howthis can be consolidated should be investigated.

4.6.4 From Embedding to Composition

As pointed out in the introduction 1.3 some questions must be answered when lan-guages are composed rather than embedded. The existence of two separate languagetoolkits that are combined by an editor are not sufficient. A clear and flexible de-sign for composition of scopes and symbol tables must be devised in order to allowcross-referencing and expressive language composites. We envision an interface-based mechanism that creates a new toolkit on top of the existing languages to becomposed to maintain the fundamental demand for side effect-free reusability.

Page 74: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

60 4. Implementation

Page 75: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

5. Summary and Conclusions

We investigated how scannerless parsing can solve problems involved in compositionof languages. While integrating lexical and syntactic analysis entails the funda-mental advantage that lexical conflicts can be solved with the help of a construct’scontext, challenges for error handling result from the absence of tokens. Apply-ing a backtracking parsing strategy instead of a predicting one has shown to bemanageable in this context.

Feasibility

By designing and implementing the transformation from a concrete-to-abstract syn-tax mapping and the specified language metamodel we showed that model-injectingparsers can be automatically derived and lead to valid domain parsers, notwith-standing details involving scopes and repetitions.

Feasibility of error reporting and recovery techniques is most likely a more complextask. This is due to the fact that when tokens are created during the syntacticanalysis phase they cannot be considered for erroneous code fragments. The absenceof tokens is the most critical issue when handling errors.

A two-pass strategy is not expected to succeed either: creating tokens beforehandwithout using them for syntactic analysis is probably a too expensive task andimpedes performance unnecessarily.

Backtracking of the generated parsers is, in principle, not considered a critical is-sue for composition. We showed, by using lightweight nested transactions in theimplementation 4.3, that the parsing process can be observed sufficiently.

We therefore recommend to continue the work on scannerless parsing techniques inthe context of concrete textual syntax for models.

Page 76: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

62 5. Summary and Conclusions

Page 77: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

Bibliography

[ASU86] Alfred V. Aho, Ravi Sethi, and Jeffrey D. Ullman. Compilers: prin-ciples, techniques, and tools. Addison-Wesley Longman PublishingCo., Inc., Boston, MA, USA, 1986.

[AU72] Alfred V. Aho and Jeffrey D. Ullman. The Theory of Parsing, Trans-lation, and Compiling, volume I: Parsing of Series in Automatic Com-putation. Prentice Hall, Englewood Cliffs, New Jersey, 1972.

[BCE+] Heiko Behrens, Michael Clay, Sven Efftinge, Moritz Eysholdt, PeterFriese, Jan Kohnlein, Knut Wannheden, and Sebastian Zarnekow.Xtext User Guide version 0.7.

[Bra08] Martin Bravenboer. Exercises in Free Syntax. Syntax Definition,Parsing, and Assimilation of Language Conglomerates. PhD thesis,Utrecht University, Utrecht, The Netherlands, January 2008.

[BV04] Martin Bravenboer and Eelco Visser. Concrete syntax for objects:domain-specific language embedding and assimilation without re-strictions. In OOPSLA ’04: Proceedings of the 19th annual ACMSIGPLAN conference on Object-oriented programming, systems, lan-guages, and applications, pages 365–383, New York, NY, USA, 2004.ACM.

[CW01] Phil Cook and Jim Welsh. Incremental parsing in language-basededitors: user needs and how to meet them. Softw. Pract. Exper.,31(15):1461–1486, 2001.

[For02] Bryan Ford. Packrat parsing:: simple, powerful, lazy, linear time,functional pearl. In ICFP ’02: Proceedings of the seventh ACM SIG-PLAN international conference on Functional programming, pages36–47, New York, NY, USA, 2002. ACM.

[For04] Bryan Ford. Parsing expression grammars: a recognition-based syn-tactic foundation. In POPL ’04: Proceedings of the 31st ACMSIGPLAN-SIGACT symposium on Principles of programming lan-guages, pages 111–122, New York, NY, USA, 2004. ACM.

[GBU08] Thomas Goldschmidt, Steffen Becker, and Axel Uhl. Classificationof concrete textual syntax mapping approaches. In ECMDA-FA ’08:Proceedings of the 4th European conference on Model Driven Archi-tecture, pages 169–184, Berlin, Heidelberg, 2008. Springer-Verlag.

Page 78: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

64 Bibliography

[GBU09] Thomas Goldschmidt, Steffen Becker, and Axel Uhl. Textual viewsin model driven engineering. In Proceedings of the 35th EUROMI-CRO Conference on Software Engineering and Advanced Applications(SEAA). IEEE, 2009.

[GKR+07] Hans Gronniger, Holger Krahn, Bernhard Rumpe, Martin Schindler,and Steven Volkel. Textual modeling. In Proceedings of the 4th In-ternational Workshop on Language Engineering (ATEM 2007), 2007.

[Gol09] Thomas Goldschmidt. Towards an incremental update approach forconcrete textual syntaxes for uuid-based model repositories. pages168–177, 2009.

[Gri06] Robert Grimm. Better extensibility through modular syntax. InPLDI ’06: Proceedings of the 2006 ACM SIGPLAN conference onProgramming language design and implementation, pages 38–51, NewYork, NY, USA, 2006. ACM.

[HHJ+08] Jakob Henriksson, Florian Heidenreich, Jendrik Johannes, SteffenZschaler, and Uwe Aßmann. Extending grammars and metamod-els for reuse: the reuseware approach. IET Software, 2(3):165–184,2008.

[JBK06] Frederic Jouault, Jean Bezivin, and Ivan Kurtev. Tcs:: a dsl forthe specification of textual concrete syntaxes in model engineering.In GPCE ’06: Proceedings of the 5th international conference onGenerative programming and component engineering, pages 249–254,New York, NY, USA, 2006. ACM.

[KKV08] Lennart C. L. Kats, Karl Trygve Kalleberg, and Eelco Visser. Gen-erating editors for embedded languages. integrating SGLR into IMP.In A. Johnstone and J. Vinju, editors, Proceedings of the EighthWorkshop on Language Descriptions, Tools, and Applications (LDTA2008), Budapest, Hungary, April 2008.

[KRV07] Holger Krahn, Bernhard Rumpe, and Steven Volkel. Efficient ed-itor generation for compositional dsls in eclipse. In A. Johnstoneand J. Vinju, editors, Proceedings of the 7th OOPSLA Workshop onDomain-Specific Modeling (DSM’ 07), Montreal , Canada, October2007.

[Kus08] Martin Kuster. EMF implementation of a grammar based transfor-mation framework for source code analysis. Studienarbeit, Univer-sitat Karlsruhe, 2008.

[Li95] Warren X. Li. A simple and efficient incremental LL(1) parsing. InSOFSEM ’95: Proceedings of the 22nd Seminar on Current Trendsin Theory and Practice of Informatics, pages 399–404, London, UK,1995. Springer-Verlag.

[RS83] V J. Rayward-Smith. A first course in formal language theory. Black-well Scientific Publications, Ltd., Oxford, UK, UK, 1983.

Page 79: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

Bibliography 65

[SC89] Daniel J. Salomon and Gordon V. Cormack. Corrections to the paper:Scannerless nslr(1) parsing of programming languages. SIGPLANNotices, 24(11):80–83, 1989.

[Shi93] John J. Shilling. Incremental ll(1) parsing in language-based editors.IEEE Trans. Software Eng., 19(9):935–940, 1993.

[Tom87] Masaru Tomita. An efficient augmented-context-free parsing algo-rithm. Computational Linguistics, 12(1-2):31–46, 1987.

[vdBSVV02] M.G.J. van den Brand, J. Scheerder, J. J. Vinju, and E. Visser. Dis-ambiguation filters for scannerless generalized lr parsers. In CompilerConstruction (CC’02), pages 143–158. Springer-Verlag, 2002.

[vdBvDH+01] Mark van den Brand, Arie van Deursen, Jan Heering, H. A. de Jong,Merijn de Jonge, Tobias Kuipers, Paul Klint, Leon Moonen, Pieter A.Olivier, Jeroen Scheerder, Jurgen J. Vinju, Eelco Visser, and JoostVisser. The asf+sdf meta-environment: A component-based languagedevelopment environment. In CC, pages 365–370, 2001.

[Vis97] Eelco Visser. Syntax Definition for Language Prototyping. PhD the-sis, Faculteit Wiskunde, Informatica, Natuurkunde en Sterenkunde,Universiteit van Amsterdam, 1997.

[Wag98] Tim A. Wagner. Practical algorithms for incremental software devel-opment environments. Technical report, Berkeley, CA, USA, 1998.

Page 80: Modularization of Text-to-Model Mapping Specifications - A Feasibility Study Using Scannerless Parsing

66 Bibliography