lexical analyzers and parsers

39
Heshan T. Suriyaarachchi

Upload: heshan-suriyaarachchi

Post on 01-Dec-2014

18.411 views

Category:

Technology


2 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Lexical Analyzers and Parsers

Heshan T. Suriyaarachchi

Page 2: Lexical Analyzers and Parsers
Page 3: Lexical Analyzers and Parsers

• lexical analysis is the process of converting a sequence of characters into a sequence of tokens.

• Programs performing lexical analysis are called lexical analyzers or lexers.

• A lexer consists of a scanner and a tokenizer.

Page 4: Lexical Analyzers and Parsers
Page 5: Lexical Analyzers and Parsers

• A lexical analyzer breaks an input stream of characters into tokens.

• Writing lexical analyzers by hand can be a tedious process, so software tools have been developed to ease this task.

• Perhaps the best known such utility is Lex. Lex is a lexical analyzer generator for the UNIX operating system, targeted to the C programming language

Page 6: Lexical Analyzers and Parsers

• Lex takes a specially-formatted specification file containing the details of a lexical analyzer. This tool then creates a C source file for the associated table-driven lexer.

• The JLex utility is based upon the Lex lexical analyzer generator model.

• JLex takes a specification file similar to that accepted by Lex, then creates a Java source file for the corresponding lexical analyzer.

Page 7: Lexical Analyzers and Parsers

• A JLex input file is organized into three sections, separated by double-percent directives (``%%'').

• A proper JLex specification has the following format.

Page 8: Lexical Analyzers and Parsers

user code%%JLex directives%%regular expression rules

The ``%%'' directives distinguish sections of the input file and must be placed at the beginning of their line.

Page 9: Lexical Analyzers and Parsers

• The user code section - the first section of the specification file - is copied directly into the resulting output file. This area of the specification provides space for the implementation of utility classes or return types.

• The JLex directives section is the second part of the input file. Here, macros definitions are given and state names are declared.

• The third section contains the rules of lexical analysis, each of which consists of three parts: an optional state list, a regular expression, and an action.

Page 10: Lexical Analyzers and Parsers

• This code is copied verbatim into the lexical analyzer source file that JLex outputs, at the top of the file.

• Therefore, if the lexer source file needs to begin with a package declaration or with the importation of an external class, the user code section should begin with the corresponding declaration.

• This declaration will then be copied onto the top of the generated source file.

Page 11: Lexical Analyzers and Parsers

• The JLex directive section begins after the first ``%%'' and continues until the second ``%%'' delimiter. Each JLex directive should be contained on a single line and should begin that line.

Page 12: Lexical Analyzers and Parsers

• The third part of the JLex specification consists of a series of rules for breaking the input stream into tokens.

• These rules specify regular expressions, then associate these expressions with actions consisting of Java source code.

• The rules have three distinct parts: the optional state list, the regular expression, and the associated action. This format is represented as follows.

[<states>] <expression> { <action> }

Page 13: Lexical Analyzers and Parsers

• If more than one rule matches strings from its input, the generated lexer resolves conflicts between rules by greedily choosing the rule that matches the longest string.

• Rules appearing earlier in the specification are given a higher priority by the generated lexer.

• If the generated lexical analyzer receives input that does not match any of its rules, an error will be raised.

Page 14: Lexical Analyzers and Parsers

Therefore, all input should be matched by at least one rule. This can be guaranteed by placing the following rule at the bottom of a JLex specification:

. { java.lang.System.out.println("Unmatched input: " + yytext()); }

The dot (.) , will match any input except for the newline

Page 15: Lexical Analyzers and Parsers

• JLex will take a properly-formed specification and transform it into a Java source file for the corresponding lexical analyzer.

• A benchmark experiment was conducted, comparing the performance of a lexical analyzer generated by JLex to that of a hand-written lexical analyzer.

Page 16: Lexical Analyzers and Parsers

• The comparison was made for lexical analyzers of a simple ''toy'' programming language. The hand-written lexical analyzer was written in Java.

• The experiment consists of running each lexical analyzer on two source files written in the toy language, then measuring the time required to process these files.

Page 17: Lexical Analyzers and Parsers

The generated lexical analyzer proved to be quite quick, as the following results show.

Source File JLex-Generated Hand-Written Lexical Analyzer Lexical Analyzer

177 lines 0.42 seconds 0.53 seconds897 lines 0.98 seconds 1.28 seconds

The JLex lexical analyzer soundly outperformed the hand-written lexer.

Page 18: Lexical Analyzers and Parsers

One of the biggest complaints about table-driven lexical analyzers generated by programs like JLex is that these lexical analyzers do not perform as well as hand-written ones.

Therefore, this experiment is particularly important in demonstrating the relative speed of JLex lexical analyzers.

Page 19: Lexical Analyzers and Parsers

The following is a (possibly incomplete) list of unimplemented features of JLex.

1)The regular expression lookahead operator is unimplemented, and not included in the list of special regular expression metacharacters.

2)The start-of-line operator (^) assumes the following nonstandard behavior. A match on a regular expression that uses this operator will cause the newline that precedes the match to be discarded.

Page 20: Lexical Analyzers and Parsers

• Javac Main.java

• Java JLex.Main Sample.lex

• Javac sample.lex.java

• Java Sample

Page 21: Lexical Analyzers and Parsers

• Java CUP is a parser generator for Java

• Java CUP compatibility is turned off by default, but can be activated with the following JLex directive.

%cup

Page 22: Lexical Analyzers and Parsers

When given, this directive makes the generated scanner conform to the java_cup.runtime.Scanner interface. It has the same effect as the following three directives:

%implements java_cup.runtime.Scanner%function next_token%type java_cup.runtime.Symbol

Page 23: Lexical Analyzers and Parsers
Page 24: Lexical Analyzers and Parsers
Page 25: Lexical Analyzers and Parsers
Page 26: Lexical Analyzers and Parsers

Parsing (syntactic analysis) is the process of analyzing a sequence of tokens to determine their grammatical structure with respect to a given (more or less) formal grammar.

Page 27: Lexical Analyzers and Parsers

Document Object Model Platform- and language-independent

standard object model for representing HTML or XML and related formats.

Tree Structure based API:     The Dom parser implements the dom api and it creates a DOM tree in memory for a XML document

Page 28: Lexical Analyzers and Parsers

supports navigation in any direction (e.g., parent and previous sibling) and allows for arbitrary modifications

an implementation must at least buffer the document that has been read so far (or some parsed form of it).

best suited for applications where the document must be accessed repeatedly or out of sequence order

Page 29: Lexical Analyzers and Parsers

DOM parsers must have the entire tree in memory before any processing can begin, so the amount of memory used by a DOM parser depends entirely on the size of the input data.

Page 30: Lexical Analyzers and Parsers

When to use DOM parser Manipulate the document Traverse the document back and forth Small XML files

Drawbacks of DOM parser Consumes a lot of memory

Page 31: Lexical Analyzers and Parsers

Is a serial access parser API for XML. Provides a mechanism for reading data

from an XML document. Popular alternative to the DOM. The quantity of memory that a SAX

parser must use in order to function is typically much smaller than that of a DOM parser.

Page 32: Lexical Analyzers and Parsers

Because of the event-driven nature of SAX, processing documents can often be faster than DOM-style parsers

Page 33: Lexical Analyzers and Parsers

When to use SAX parser No structural modification Huge XML files

Drawbacks of SAX Parser Certain kinds of XML validation require

access to the document in full

Page 34: Lexical Analyzers and Parsers

OM stands for Object Model (also known as AXIOM - AXis Object Model)

Refers to the XML infoset model that is initially developed for Apache Axis2.

For an object oriented language the obvious choice is a model made up of objects. DOM and JDOM are two such XML models.

Page 35: Lexical Analyzers and Parsers

OM is conceptually similar to such an XML model by its external behavior but deep down it is very much different.

OM is based on Pull Parsing instead of Push Parsing.

Page 36: Lexical Analyzers and Parsers

Pull parsing is a recent trend in XML processing.

The previously popular XML processing frameworks such as SAX and DOM were "push-based“, which means the control of the parsing was in the hands of the parser itself.

Page 37: Lexical Analyzers and Parsers

Push-based approach is fine and easy to use, but it was not efficient in handling large XML documents since a complete memory model will be generated in the memory.

Pull parsing inverts the control and hence the parser only proceeds at the users command.

The user can decide to store or discard events generated from the parser.

Page 38: Lexical Analyzers and Parsers

Credits goes out to• Mr Elliot Joel Berk who wrote JLex.• To the Department of Computer Science,

Princeton University for maintaining JLex.• All the others who contributed towards these

projects.

A special thanks goes out to • Dr. Damith Karunaratne for giving me this

opportunity.

Page 39: Lexical Analyzers and Parsers