using javacc

22
Using JavaCC Using JavaCC Professor Yihjia Professor Yihjia Tsai Tsai Tamkang University Tamkang University

Upload: indira-cote

Post on 30-Dec-2015

64 views

Category:

Documents


0 download

DESCRIPTION

Using JavaCC. Professor Yihjia Tsai Tamkang University. String stream. Scanner generator. Java scanner program. NFA. RE. DFA. Minimize DFA. Simulate DFA. Automating Lexical Analysis Overall picture. Tokens. Building Faster Scanners from the DFA. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Using JavaCC

Using JavaCCUsing JavaCC

Professor Yihjia TsaiProfessor Yihjia Tsai

Tamkang UniversityTamkang University

Page 2: Using JavaCC

2

Automating Lexical Analysis Automating Lexical Analysis Overall pictureOverall picture

Tokens

Scanner generator

NFAREJava scanner program

String stream

DFA

Minimize DFA

Simulate DFA

Page 3: Using JavaCC

3

Building Faster Scanners Building Faster Scanners from the from the DFADFA

Table-driven recognizers waste a lot of effortTable-driven recognizers waste a lot of effort• Read (& classify) the next characterRead (& classify) the next character• Find the next state Find the next state • Assign to the state variable Assign to the state variable • Branch back to the topBranch back to the top

We can do betterWe can do better• Encode state & actions in the code Encode state & actions in the code • Do transition tests locallyDo transition tests locally• Generate ugly, spaghetti-like codeGenerate ugly, spaghetti-like code (it is OK, this is automatically generated code)(it is OK, this is automatically generated code)• Takes (many) fewer operations per input characterTakes (many) fewer operations per input character

state = s0 ;

string = ; char = get_next_char();while (char != eof) { state = (state,char); string = string + char; char = get_next_char();}if (state in Final) then report acceptance;else report failure;

Page 4: Using JavaCC

4

Inside lexical analyzer Inside lexical analyzer generatorgenerator

• How does a lexical analyzer work?How does a lexical analyzer work?– Get input from user who defines tokens Get input from user who defines tokens

in the form that is equivalent to regular in the form that is equivalent to regular grammargrammar

– Turn the regular grammar into a NFATurn the regular grammar into a NFA– Convert the NFA into DFAConvert the NFA into DFA

– Generate the code that simulates theGenerate the code that simulates the DFADFA

Page 5: Using JavaCC

5

Flow for Using JavaCCFlow for Using JavaCC

Extracted from http://www.cs.unb.ca/profs/nickerson/courses/cs4905/Labs/L1_2006.pdf

Page 6: Using JavaCC

6

Structure of a JavaCC FileStructure of a JavaCC File

• A JavaCC file is composed of 3 portions: A JavaCC file is composed of 3 portions: – OptionsOptions– Class declarationClass declaration– Specification for lexical analysis (tokens), Specification for lexical analysis (tokens),

and specification for syntax analysis. and specification for syntax analysis.

• For the very first example of JavaCC, let's For the very first example of JavaCC, let's recognize two tokens: ``+'', and recognize two tokens: ``+'', and numerals. numerals.

• Use an editor to edit and save it with file Use an editor to edit and save it with file name name numeral.jjnumeral.jj

Focus of this Focus of this LectureLecture

Focus of this Focus of this LectureLecture

Page 7: Using JavaCC

7

Using javaCC for lexical analysisUsing javaCC for lexical analysis

• javacc is a “top-down” parser javacc is a “top-down” parser generator.generator.

• Some parser generators (such as Some parser generators (such as yacc , bison, and JavaCUP) need yacc , bison, and JavaCUP) need a separate lexical-analyzer a separate lexical-analyzer generator.generator.

• With javaCC, you can specify the With javaCC, you can specify the tokens within the parser tokens within the parser generator.generator.

Page 8: Using JavaCC

8

Example FileExample File

/* main class definition */PARSER_BEGIN(Numeral)public class Numeral{ public static void main(String[] args) throws ParseException, TokenMgrError { Numeral numeral = new Numeral(System.in); while (numeral.getNextToken().kind!=EOF); }}PARSER_END(Numeral)

/* token definitions */TOKEN:{ <ADD: "+">| <NUMERAL: (["0"-"9"])+>}

Page 9: Using JavaCC

9

OptionsOptions

• The options portion is optional and is omitted in The options portion is optional and is omitted in the previous example. the previous example.

• STATIC is a boolean option whose default value is STATIC is a boolean option whose default value is true. If true, all methods and class variables are true. If true, all methods and class variables are specified as static in the generated parser and specified as static in the generated parser and token manager. token manager. – This allows only one parser object to be present, but it This allows only one parser object to be present, but it

improves the performance of the parser. improves the performance of the parser. – To perform multiple parses during one run of your Java To perform multiple parses during one run of your Java

program, you will have to call the ReInit() method to program, you will have to call the ReInit() method to reinitialize your parser if it is static. reinitialize your parser if it is static.

– If the parser is non-static, you may use the "new" If the parser is non-static, you may use the "new" operator to construct as many parsers as you wish. operator to construct as many parsers as you wish. These can all be used simultaneously from different These can all be used simultaneously from different threads. threads.

Page 10: Using JavaCC

10

StartStart/* main class definition */PARSER_BEGIN(Numeral)public class Numeral{ public static void main(String[] args) throws ParseException, TokenMgrError { Numeral numeral = new Numeral(System.in); while (numeral.getNextToken().kind!=EOF); }}PARSER_END(Numeral)

/* token definitions */TOKEN:{ <ADD: "+">| <NUMERAL: (["0"-"9"])+>}

Simple Loop

Getting Tokens

Simple Loop

Getting Tokens

Page 11: Using JavaCC

11

CompilationCompilation

After calling javacc to compile numeral.jj, eight files are generated if no error messages occur.

They are Numeral.java, NumberalConstants.java, NumeralTokenManger.java, ParseException.java, SimpleCharStream.java, Token.java, and TokenMgrError.java.

bash-2.05$ javacc numeral.jj

Java Compiler Compiler Version 3.2 (Parser Generator)

(type "javacc" with no arguments for help)

Reading from file numeral.jj . . .

File "TokenMgrError.java" does not exist. Will create one.

File "ParseException.java" does not exist. Will create one.

File "Token.java" does not exist. Will create one.

File "SimpleCharStream.java" does not exist. Will create one.

Parser generated successfully

Page 12: Using JavaCC

12

javaCC specification of a lexerjavaCC specification of a lexer

Note the needNote the need for ( )!for ( )!

Defining Defining WhitespaceWhitespace

Page 13: Using JavaCC

A Full ExampleA Full Example

See the sample fileSee the sample file

Page 14: Using JavaCC

14

Dealing with errorsDealing with errors

• Error reporting:Error reporting: 123e+q 123e+q• Could consider it an invalid token Could consider it an invalid token

(lexical error) or (lexical error) or • return a sequence of valid tokens return a sequence of valid tokens

– 123, e, +, q, 123, e, +, q, – and let the parser deal with the error.and let the parser deal with the error.

Page 15: Using JavaCC

15

Lexical error correction?Lexical error correction?

• Sometimes interaction between Sometimes interaction between the Scanner and parser can helpthe Scanner and parser can help– especially in a top-down (predictive) especially in a top-down (predictive)

parseparse– The parser, when it calls the scanner, The parser, when it calls the scanner,

can pass as an argument the set of can pass as an argument the set of allowable tokens.allowable tokens.

– Suppose the Scanner sees Suppose the Scanner sees calsscalss in a in a context where only a top-level context where only a top-level definition is allowed. definition is allowed.

Page 16: Using JavaCC

16

Same symbol, different Same symbol, different meaning.meaning.

• How can the scanner distinguish How can the scanner distinguish between binary minus and unary between binary minus and unary minus?minus?– x = -a; x = -a; vsvs x = 3 – a;x = 3 – a;

Page 17: Using JavaCC

17

Scanner “troublemakers”Scanner “troublemakers”

• Unclosed stringsUnclosed strings• Unclosed comments.Unclosed comments.

Page 18: Using JavaCC

JavaCC as a Parsing JavaCC as a Parsing ToolTool

Page 19: Using JavaCC

19

Javacc OverviewJavacc Overview

• Generates a top down parser.Generates a top down parser.Could be used for generating a Prolog Could be used for generating a Prolog

parser which is in LL.parser which is in LL.• Generates a parser in Java.Generates a parser in Java.

Hence can be integrated with any Java Hence can be integrated with any Java based Prolog compiler/interpreter to based Prolog compiler/interpreter to continue our example.continue our example.

• Token specification and grammar Token specification and grammar specification structures are in the same specification structures are in the same file => easier to debugfile => easier to debug..

Page 20: Using JavaCC

20

Types of Productions in JavaccTypes of Productions in Javacc

There can be four different kinds of Productions.There can be four different kinds of Productions.• Javacode Javacode

For something that is not context free or is difficult to For something that is not context free or is difficult to write a grammar for.write a grammar for.eg) recognizing matching braces and error processing.eg) recognizing matching braces and error processing.

• Regular ExpressionsRegular Expressions Used to describe the tokens (terminals) of the Used to describe the tokens (terminals) of the

grammar.grammar.• BNFBNF

Standard way of specifying the productions of the Standard way of specifying the productions of the grammar.grammar.

• Token Manager DeclarationsToken Manager Declarations The declarations and statements are written into the The declarations and statements are written into the

generated Token Manager (lexer) and are accessible generated Token Manager (lexer) and are accessible from within lexical actions.from within lexical actions.

Page 21: Using JavaCC

21

Javacc Look-ahead mechanismJavacc Look-ahead mechanism

• Exploration of tokens further ahead in the input stream.Exploration of tokens further ahead in the input stream.• Backtracking is unacceptable due to performance hit.Backtracking is unacceptable due to performance hit.• By default Javacc has 1 token look-ahead. Could specify any By default Javacc has 1 token look-ahead. Could specify any

number for look-ahead.number for look-ahead.• Two types of look-ahead mechanismsTwo types of look-ahead mechanisms

Syntactic Syntactic A particular token is looked ahead in the input A particular token is looked ahead in the input

stream.stream. SemanticSemantic

Any arbitrary Boolean expression can be specified Any arbitrary Boolean expression can be specified as a look-ahead parameter.as a look-ahead parameter.

eg) A -> aBc and B -> b ( c )? Valid strings: eg) A -> aBc and B -> b ( c )? Valid strings: “abc” and “abcc”“abc” and “abcc”

Page 22: Using JavaCC

22

ReferencesReferences

• Compilers Principles, Techniques and Compilers Principles, Techniques and Tools, Aho, Sethi, and UllmanTools, Aho, Sethi, and Ullman