attribute grammars and parsers for chunk-based binary...
TRANSCRIPT
i
Attribute Grammars and Parsers
for Chunk-Based Binary File Formats
William Underwood
Sandra Laib
ICL/ITDSD Technical Report 12-01
December 2012
Information Technology and Decision Support Division
Information and Communications Laboratory
Georgia Tech Research Institute
Atlanta, Georgia
This research was sponsored by the Army Research Laboratory (ARL) and the Applied Research
Division of the National Archives and Records Administration (NARA) and was accomplished
under Cooperative Agreement Number W911NF-10-2-0030. The views and conclusions
contained in this document are those of the authors and should not be interpreted as representing
the official policies, either expressed or implied, of the Army Research Laboratory, NARA, or
the U.S. Government. The U.S. government is authorized to reproduce and distribute reprints for
Government purposes notwithstanding any copyright notation heron.
iii
ABSTRACT
The capability to validate and view or play binary file formats, as well as to convert binary file
formats to standard or current file formats, is critically important to the preservation of digital
data and records. This report describes the extension of context-free grammars from strings to
binary arrays. Binary files are arrays of data types such as long and short integers, floating-point
numbers, and pointers as well as characters. The concept of an attribute grammar is extended to
these context-free binary array grammars. Binary array attribute grammars have been used to
define a number of chunk-based file formats. These grammars consist of grammar rules for
specifying the structure or layout of a file format and lexical rules for defining the binary data
types of the file format. A parser generator has been used with some of these grammars to
generate parsers for binary file formats. The capability has been developed to display the parse
trees of binary array attribute grammars as pretty printed nested lists.
The significance of success in this research task is that if binary file formats can be specified
with binary array attribute grammars, then only one parsing generator is needed to generate the
parsers for verifying that a file conforms to a particular format. Similarly, the same parsing
generator can possibly be used for conversion of legacy file formats to current or standard
formats. Finally, the same parser generator can possibly be used for generating viewers/players
for most file formats. This would increase the likelihood of preserving and making available into
the indefinite future those digital records encoded in binary file formats.
iv
TABLE OF CONTENTS
1 INTRODUCTION ............................................................................................................................................. 1
1.1 BACKGROUND .......................................................................................................................................................... 1
1.2 PURPOSE ................................................................................................................................................................ 1
1.3 SCOPE .................................................................................................................................................................... 1
2 SPECIFICATION OF BINARY FILE FORMATS.......................................................................................... 1
3 CONTEXT-FREE GRAMMARS AND ATTRIBUTE GRAMMARS ............................................................. 3
3.1 CONTEXT-FREE GRAMMARS FOR ARRAYS OF DATA TYPES ................................................................................................ 4
3.2 ATTRIBUTE GRAMMARS ............................................................................................................................................. 4
4. GRAMMARS FOR CHUNK-BASED BINARY FILE FORMATS ................................................................ 5
4.1 INTERLEAVED BITMAP (ILBM) ................................................................................................................................... 6
4.2 RIFF WAVEFORM PCM AUDIO FILE FORMAT (WAVE) ................................................................................................... 9
5 RECURSIVE DESCENT PARSERS FOR BINARY FILE GRAMMARS ................................................... 10
6. A PARSER GENERATOR FOR BINARY FILE FORMATS ...................................................................... 12
6.1 ANTLR ................................................................................................................................................................ 12
6.2 USING ANTLR TO GENERATE A PARSER FOR A BINARY FILE FORMAT ............................................................................... 12
6.2.1 An ANTLR Grammar for an ILBM Chunk-based File Format ....................................................................... 13
6.2.2 The Lexer and Parser .................................................................................................................................. 17
6.3 IMPROVEMENTS TO THE LEXICAL ANALYSIS OF DATA TYPES ............................................................................................. 18
6.4 GENERATING PARSE TREES FOR BINARY ARRAY GRAMMARS ........................................................................................... 33
6.4.1 How ANTLR4 Generates Parse Trees for String Attribute Grammars ........................................................ 33
6.4.2 Using ANTLR4 to Generate Parse Trees for Binary Attribute Grammars ................................................... 34
6.4.3 Extensions to ANTLR4 to Generate Pretty Printed Parse Trees with Data Type Values ............................. 35
6.4.4 How ANTLR4’s Attribute Grammars, Semantic Predicates and Action Clauses Help with Parsing and
Validation ............................................................................................................................................................ 38
7 RELATED RESEARCH ................................................................................................................................. 40
8. CONCLUSION ............................................................................................................................................... 41
REFERENCES ................................................................................................................................................... 43
APPENDIX A: BINARY ARRAY GRAMMARS FOR CHUNK-BASED BINARY FILE FORMATS .......... 53
A.1 ELECTRONIC ARTS INTERCHANGE FILE FORMAT (IFF) .................................................................................................... 53
A.1.1 Interleaved Bitmap (ILBM) ......................................................................................................................... 53
A.1.2 8-bit Sampled Voice (8SVX) ....................................................................................................................... 53
A.2 STANDARD MIDI FILE FORMAT ................................................................................................................................. 54
A.3 AUDIO INTERCHANGE FILE FORMAT ........................................................................................................................... 54
A.4 RESOURCE INTERCHANGE FILE FORMATS .................................................................................................................... 56
A.4.1 RIFF-Waveform (WAVE) ............................................................................................................................. 56
A.4.2 WAVEFORMATEX ....................................................................................................................................... 56
A.5 JPEG ................................................................................................................................................................... 56
v
A.5.1 JPEG File Interchange Format (JFIF) .......................................................................................................... 56
A.5.2 JPEG/Exif ................................................................................................................................................... 57
A.5.3 SPIFF (Still Picture Interchange File Format) .............................................................................................. 58
A.6 ADVANCED SYSTEMS FORMAT .................................................................................................................................. 58
A.6.1 Windows Media Audio 7-9......................................................................................................................... 60
A.6.2 Windows Media Video 7-9.1 ...................................................................................................................... 61
A.7 PORTABLE NETWORK GRAPHICS ................................................................................................................................ 62
A.8 APPLE QUICKTIME MOVIE ....................................................................................................................................... 62
A.9 ESRI SHAPEFILE ..................................................................................................................................................... 64
APPENDIX B: OTHER BINARY CHUNK-BASED FILE FORMATS ................................................................................ 68
APPENDIX C: UTILITIES TO PRETTY PRINT A NESTED LIST STYLE PARSE TREE ........................................................ 71
vi
List of Figures
Figure 1. File format (Layout) of a RIFF container for a Webp Image........................................................... 2
Figure 2. ILBM File ham.pic Displayed .......................................................................................................... 6
Figure 3. An Attribute Grammar for an ILBM Chunk-based File Format ...................................................... 7
Figure 4. A parse tree for the ILBM file ham.pic ........................................................................................... 8
Figure 5. Binary Array Grammar for RIFF WAVEFORM PCM File Format ................................................... 10
Figure 6. Top-Level Grammar Rule ............................................................................................................. 10
Figure 7.Top-level Java program of a recursive descent parser ................................................................. 11
Figure 8. An ANTLR Grammar for the ILBM File Format. ............................................................................ 16
Figure 9. A Java Application for Parsing an ILBM File ................................................................................. 17
Figure 10. Output of the ILBM Parser Applied to the File BLK.IFF .............................................................. 18
Figure 11. Lexer for all Context-free Binary Array Grammars .................................................................... 19
Figure 12. Data Translation and Other Gammar Utilities ........................................................................... 21
Figure 13. Datatype rules used by all binary array attribute grammars ..................................................... 23
Figure 14. The ILBM File Format Grammar ................................................................................................. 26
Figure 15. The Wave_Form File Format Grammar ..................................................................................... 32
Figure 16. Parse tree shown as a nested list. .............................................................................................. 33
Figure 17. Parse tree shown as a graph. ..................................................................................................... 33
Figure 18. ILBM File Docking.bru ................................................................................................................ 34
Figure 19. Parse Tree Created by ANTLR4 .................................................................................................. 35
Figure 20. Pretty Printed Parse Tree of ILBM Formated File Docking.bru .................................................. 38
1
1 Introduction
1.1 Background
Automated tools are required for identifying and validating the formats of the huge number of
files ingested into digital data and record archives. Automated tools are also needed for
viewing/playing text and binary file formats, and for converting legacy and obsolete file formats
to standard or current formats. Technologies such as data description languages have emerged to
address these preservation challenges [Dunckley et al 2007].
Context-free grammars have been used to specify the syntax of programming languages.
Attribute grammars provide a framework for formally specifying the semantics of a language
based on its context-free grammar and for addressing the mildly context-sensitive features of
programming languages such as agreement of the data types of variables in expressions with data
type declarations. Such grammars have parsing/translation algorithms that can be used for
syntax-checking and interpretation or translation of the languages the grammars define.
1.2 Purpose
The research question addressed by this research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files.
1.3 Scope
The next section discusses the traditional approach to specifying file formats and two of the
major families of binary file formats. In section 3, extensions to the concepts of context-free
grammars and attribute grammars are described that enable the specification of binary file
formats. In section 4, examples of binary array attribute grammars for a chunk-based binary file
formats are presented. In section 5, recursive descent parsers for these kinds of grammars are
described. In section 6, experience in using ANTLR, a parser generator for LL(k) string
grammars, in generating parsers for recognizing the formats of binary files is discussed. This
includes the extension of ANTLR to work with grammars with data types other than strings and
the creation of parse trees represented as pretty printed nested lists. In section 7, related research
is described. Section 8 summarizes the results.
2 Specification of Binary File Formats
Traditionally, a file (or record) layout was a form showing how fields were positioned in a file
(or record). The fields are named and have as attributes a data type, length and sometimes a
constant value or range of values. Fields are offset at addresses relative to the beginning of a file.
2
File formats are still specified in this manner. For instance Figure 1 shows the format layout of a
RIFF container for the VP8 codec of a Webp image [Google 2010].
0ffset Description
0 "RIFF" 4-byte tag
4 long integer: size of data chunk starting at offset 8
8 "WEBP" the form-type signature
12 "VP8 " 4-bytes tag, describing the raw video format used
16 long integer: size of the raw VP8 image data chunk, starting at offset 20
20 the VP8 image data
Figure 1. File format (Layout) of a RIFF container for a Webp Image.
Many binary file formats are specified using pseudo-regular expressions or pseudo-EBNF
notation. EBNF (Extended Backus-Naur Form) is a notation for expressing context-free
grammars in a compact, human readable way. Pseudo-EBNF is similar in concept to pseudocode,
which is a high-level description of a computer algorithm that is intended for human
understanding, but that omits details that would be necessary for computer execution. The
pseudo-EBNF (or regular expression) specification of a file format is augmented with a natural
language description of the details that are not actually expressible in the EBNF (or regular
expression) notation. These details are often the context-sensitive relationships of the declared
size of an array to its actual length, or the relationship of an address pointer to the actual location
of the data pointed to in the file. These context-sensitive relationships are not expressible in a
context-free grammar (or EBNF notation). It is a goal of this research to extend the EBNF
notation and concept of a context-free grammar to include the details that are necessary to
precisely specify binary file formats that include these context-sensitive features.
Binary file formats with a similar file structure are referred to as a family of file formats. There
are two families of binary file formats that are readily distinguished, the chunk-based and the
directory-based file formats.
Chunk-based file formats were created by Electronic Arts and Commodore-Amiga as the
Interchange File Format (IFF) [Morrison 1985]. A chunk consists of an ID tag, a size, and size
bytes of data. The data may include chunks called subchunks. Subchunks have the same structure
as chunks. Subchunks can have subchunks. Chunk-based file formats are primarily used as
containers for multimedia, but have also been used for word processing files including pictures
and other figures. File format specifications for chunk-based formats may use other terms than
3
chunks in describing the format, for instance, "atoms" in QuickTime/MP4, "segments" in JPEG,
“boxes” in JPEG 2000, and “tagged data representations” in AutoCAD DXF.
The Audio Interchange File Format (AIFF) developed by Apple computer in 1988 was based on
Electronic Arts Interchange File Format. Apple Core Audio Format (CAF), Apple’s replacement
format for AIFF, remains a chunk-based format [Apple Computer 2005]. The Resource
Interchange File Format (RIFF) introduced by IBM and Microsoft [1991] is also based on
Electronic Arts’ Interchange File Format. The Microsoft container formats like Audio-Video
Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis. WebP, a picture format
recently introduced by Google[2010], also uses RIFF as a container.
Microsoft’s Advanced Systems Format (ASF) [Microsoft 2004] is also a chunk-based container
format. The most common file formats contained within an ASF file are Windows Media Audio
(WMA) and Windows Media Video (WMV). Microsoft’s Binary Interchange File Format
(Microsoft Excel’s File Format) is chunk-based [Rentz 2008].
Another family of binary file formats is directory-based. A directory-based file format consists of
one or more directory tables which contain one or more directory entries. A directory entry
specifies where the actual data is located. This is the scheme used in TIFF files, OLE (Microsoft
Object Linking and Embedding) files, OASIS OpenDocument and Microsoft Open Office files.
3 Context-Free Grammars and Attribute Grammars
Context-free grammars have been widely used to define programming languages. A context-free
grammar G is a quadruple <N, Σ, S, P> where:
N is a finite set of non-terminal symbols,
Σ is a finite set of terminal symbols,
S ∈ N is the start symbol,
P is a set of production rules of the form A → {N ∪ Σ}* where A ∈ N
The set of all strings over a vocabulary Σ is denoted by Σ*. Formally, a string is said to directly
derive in G a string ', denoted G ', if ' can be obtained from by replacing a substring with
, where is a production rule in G. A string is said to derive ' in G, denoted G* ', if
0 G . . . G 'n for some 0, . . . , n such that 0 = and n = '.
The language generated by a context-free grammar G is the set L(G) = {w: w ∈ Σ* and S G*
w}.
4
3.1 Context-Free Grammars for Arrays of Data Types
An array is a data structure consisting of a collection of elements (values or variables), each
identified by an index. An array is stored so that the position of each element can be computed
from its index. For example, an array of 10 integer variables, with indices 0 through 9, may be
stored as 10 (4 byte) words at file addresses 0, 4, 8, …36, so that the element with index i has the
address 4 × i.
Arrays are used to implement many other data structures, such as lists and strings. In most
modern computers, the internal memory and external memory of storage devices (and files on
those devices) is a one-dimensional array of data types, whose indices are their addresses.
A data type is a classification of one of various types of data, such as floating-point, integer,
character, pointer, or Boolean, that determines the possible values for that type, the operations
that can be performed on values of that type, and the way values of that type can be stored. Data
types have names such as int16, int32, float, char, bool, ptr. We define a binary file to be an array
of values of data of various types.
We define a context-free binary array grammar BAG as a quintuple <N, D, Σ, S, P> where:
N is a finite set of non-terminal symbols,
D is a set of data types,
Σ is a finite set of binary values of data types D called terminals,
S ∈ N is the start symbol,
P is a set of production rules of the form A → N* and A → t where A ∈ N and t ∈ Σ. The
first set of productions are called structural rules and the second are called lexical rules.
The set of all binary arrays is BinaryArrays = [Indices→ Σ], where Indices = {1, . . . , ∞}.
Formally, a string is said to directly derive in BAG a string ', denoted BAG ', if ' can be
obtained from by replacing a substring with , where is a production rule in G. A string
is said to derive ' in BAG, denoted BAG* ', if 0 G . . . G 'n for some 0, . . . , n such that
0 = and n = '.
The language generated by a context-free binary array grammar BAG is the set L(BAG) = {w:
w ∈ BinaryArrays and S BAG* w}.
3.2 Attribute Grammars
Context-free grammars cannot represent context-sensitive aspects of programming languages
such as: (1) enforcing the constraint that all variables are declared before they are used, or (2)
5
checking the number of parameters in a function call against the number in the function’s
declaration. Context-free grammars have also been used to define the syntax of English and other
natural languages, but they cannot define such context-sensitivity as subject verb agreement in
English sentences. A number of extensions of context-free grammars have been proposed to
address the context-sensitivity of programming languages and natural languages. These include
indexed [Aho 1968], recording [Barth 1979], affix [Koster 1971], VW [van Wijngaarden 1974]
and attribute grammars [Knuth 1969, 1971].
Context-free grammars also cannot represent the semantics of programming languages. Knuth
[1968, 1971] proposed an extension of context-free grammars termed attribute grammars that
addresses the semantics as well as the context-sensitivity of programming languages.
An attribute grammar AG is a triple <G, A, AR>, where G is a context-free grammar for the
language, A associates each grammar symbol X ∈ (N ∪ Σ) with a set of attributes, and AR
associates each production R ∈ P with a set of attribute computation and/or condition rules.
A(X), where X ∈ (N ∪ Σ), can be further partitioned into two sets: synthesized attributes S(X)
and inherited attributes I(X). AR(R), where R ∈ P, contains rules for computing inherited and
synthesized attributes associated with the symbols in the production R. Synthesized attributes are
evaluated and passed up from deeper terminals and non-terminals while inherited attributes are
passed down from upper non-terminals.
An L-attributed grammar is an attribute grammar that uses:
i) synthesized attributes, and
ii) inherited attributes where the inherited attribute depends only on:
a) attributes of the other child nodes to its left
b) inherited attributes of P itself.
L-attributed grammars allow the attributes to be evaluated in one left-to-right traversal of the
abstract syntax tree. As a result, attribute evaluation in L-attributed grammars can be
incorporated conveniently into top-down parsing.
4. Grammars for Chunk-based Binary File Formats
The attribute grammars that we present are a combination of synthesized semantic rules (attribute
values of the left side of a production are calculated from attribute values of symbols on the right
side) and inherited rules (attribute values of symbols on the right side of a production are
calculated from attribute values of the symbol on the left side).
6
4.1 InterLeaved BitMap (ILBM)
The author of the Interleaved Bitmap (ILBM) file format recognized that the structure of the file
format could be specified using pseudo EBNF rules and C data structure declarations [Morrison
1986]. In this section, the “grammar” of the ILBM file format specification is represented as a
context-free binary array grammar with synthesized and inherited attributes.
Figure 2 shows a display of an ILBM file ham.pic. Figure 3 shows an attribute grammar for its
chunk-based binary file format. The attribute grammar is a combination of synthesized semantic
rules (attribute values of the left side of a production are calculated from attribute values of
symbols on the right side) and inherited rules (attribute values of symbols on the right side of a
production are calculated from attribute values of the symbol on the left side). The attribute
grammar is also an L-attributed grammar. Figure 4 shows a manually generated parse tree for
the file ham.pic.
Figure 2. ILBM File ham.pic Displayed
7
<ILBM> → “FORM” <cksize UINT32> “ILBM”
<propertyChunk> {propertyChunk.foundBMHD==true}?
<dataChunk><BODY>
<propertyChunk> → (<BMHD> | <CMAP> | <CAMG>)+
<BMHD> → “BMHD” <cksize UINT32>{cksize.value==20}?
<BitmapHeader>{foundBMHD=true}
<BitMapheader> → <width UINT32>
< height UINT16>
< xposition INT16>
<yposition INT16>
<nplanes BYTE>
<masking BYTE>
<compression BYTE>
<reserved BYTE>
<transparentcolor UINT16>
<xaspect BYTE>
<yaspect BYTE>
<pagewidth INT16>
<pageheight INT16>
<CMAP> -→“CMAP” <cksize UINT32> {cksize.value mod 3==0}? {n=cksize.val/3}
<color>[n]
<color> → <red BYTE> <green BYTE> <blue BYTE>
<CAMG> → “CAMG” <cksize UINT32> {cksize.value==4}?
<viewmode type=INT32>
<dataChunk> → <CRNG><CCRT>
<CRNG> → “CRNG” <cksize UINT32> <CRange>
<CCRT> → “CCRT” <cksize UINT32> <cycleinfo>
<CRange> → <pad1, WORD> {pad1.value==0}?
<rate, WORD> <active, WORD> <low, UBTYE> <high, UBYTE>
<cycleinfo> → <direction, WORD><start, UBYTE><end, UBYTE>
<seconds, INT32><microseconds, INT32><pad, WORD>{pad.value==0}?
<BODY> → “BODY” <cksize UINT32> <data BYTE>[cksize.value]
Figure 3. An Attribute Grammar for an ILBM Chunk-based File Format
9
4.2 RIFF Waveform PCM Audio File Format (WAVE)
The waveform PCM File format was originally specified in an extended BNF notation [IBM &
Microsoft 1991]. The chunk size was omitted from the EBNF rules. Figure 5 shows the EBNF rules
with the cksize inserted. To convert this grammar into a context-free binary array grammar, data
types are added to non-terminals which would be replaced by a binary value were an additional
lexical production rule applied.
<WAVE-form> -> “RIFF” <cksize UINT32> “WAVE”
<fmt-ck>
[<fact-ck>]
[<cue-ck>]
[<playlist-ck>]
[<assoc-data-list>]
<wave-data>
<fmt-ck> -> “fmt “ <cksize UINT32> <common-fields> <wBitsPerSample UINT16>
<common-fields> -> <wFormatTag UINT16 value=0x0001>
<wChannels UINT16>
<dwSamplesPerSec UINT32>
<dwAvgBytesPerSec UINT32>
<wBlockAlign UINT16>
<wave-data> -> {<data-ck> | <wave-list>}
<data-ck> -> data <cksize UINT32> <wave-data>
<wave-list> -> LIST <cksize UINT32> “wavl” {<data-ck> | <silence-ck>}
<silence-ck> -> slnt <cksize UINT32> <dwSamples UINT32>
<fact-ck> -> factxksize <dwFileSize UINT32>
<cue-ck> -> cue <cksize UINT32> <dwCuePoints UINT32> <cue-point>+
<cue-point> -> <dwName UINT32>
<dwPosition UINT32>
<fccChunk CHAR[4]>
<dwChunkStart UINT32>
<dwBlockStart UINT32>
<dwSampleOffset UINT32>
<playlist-ck> -> “plst” <cksize UINT32> <dwSegments UINT32> <play-segment>+
<play-segment> -> <dwName UINT32>
<dwLength UINT32>
<dwLoops UINT32
<assoc-data-list>-> “LIST” <cksize UINT32> “adtl” <labl-ck><note-ck><ltxt-
ck><file-ck>
10
<labl-ck> -> “labl” <cksize UINT32> <dwName UINT32> <data:ZSTR>
<note-ck> -> “note” <cksize UINT32> <dwName UINT32> <data:ZSTR>
<ltxt-ck> -> “ltxt” <cksize UINT32> <dwName UINT32>
<dwSampleLength UINT32>
<dwPurpose UINT32>
<wCountry UINT16>
<wLanguage UINT16>
<wDialect UINT16>
<wCodePage UINT16>
<data:BYTE>+
<file-ck> -> “file” <cksize UINT32> <dwName UINT32> <dwMedType UINT32>
<fileData:BYTE>+
Figure 5. Binary Array Grammar for RIFF WAVEFORM PCM File Format
5 Recursive Descent Parsers for Binary File Grammars
A recursive descent parser is a top-down parser built from a set of recursive procedures where
each procedure implements one of the production rules of the grammar. A predictive parser is a
recursive descent parser that does not require backtracking. Predictive parsing is possible only
for the class of LL(k) grammars, which are the context-free grammars for which there exists
some positive integer k that allows a recursive descent parser to decide which production to use
by examining only the next k tokens of input.
Figure 6 shows the top-level grammar rule of the binary array attribute grammar for the ILBM
file format. The interpretation of this rule is: The first 4 bytes of the file format contain the
characters “FORM”. Next is a 4-byte (32-bit) unsigned integer indicating a chunk size that is the
size of the remainder of the file. The next 4 bytes contain the characters “ILBM”. This must be
followed by a propertyChunk, a dataChunk and a BODY chunk.
Figure 6. Top-Level Grammar Rule
11
Figure 7 shows the top-level Java program named ilbm in a top-down recursive descent parser
for the ILBM grammar. It is manually constructed from the top-level grammar rule. There are
also Java programs corresponding to <propertyChunk>, <dataChunk> and <BODY>.
The property chunks of an ILBM file must contain a BitmapHeader chunk. The attribute
foundBMHD() is passed back to the program with a value true or false, indicating whether a
BitmapHeader (BMHD) was found.
The recursive descent parsers for the binary array grammars defined in this paper do not require
a separate lexical analyzer (lexer) to identify the data of a file prior to parsing. The data types
are predicted by the grammar rules. The top-down parser defines which data types are expected
next so that the tokens (data type values) are identified while parsing.
Figure 7.Top-level Java program of a recursive descent parser
The semantic rules of an L-attributed grammar can be included in the recursive descent parser to
check for context-sensitive aspects of the grammar such as chunk size or image dimensions and
to use these values in parsing the chunk data or images. The pointers to file addresses that occur
in the file formats might be handled by pushing the current address in the file onto a pushdown
12
stack. When the parser exhausts the procedures implementing production rules corresponding to
the pointed to location, the topmost address on the pushdown stack is popped and the procedure
corresponding to the nonterminal at that position is executed.
Recursive descent parsers with a pushdown stack can be used with binary array grammars to
create file format recognizers. A recognizer (syntax checker or validator) is a parser that reads a
file and generates an error if the file does not conform to the syntax specified by the grammar.
6. A Parser Generator for Binary File Formats
What we want is a parser generator whose input is a binary array attribute grammar for a
particular binary file format, and whose generated output is the Java source code of a recursive
descent parser for the class of binary file formats specified by the grammar. Such a parser
generator does not yet exist, but there is a widely used parser generator for attribute grammars
for string-based languages.
6.1 ANTLR
ANTLR (ANother Tool for Language Recognition) is a parser generator that uses LL(k) parsing
[Parr & Quong 1995; Parr 2007]. ANTLR takes as input an attribute grammar that specifies a
language and generates as output source code for a parser for that language. ANTLR supports
generating code in a number of the programming languages including C, Java, JavaScript and
Python.
ANTLR also generates a Lexical analyzer (lexer) from lexical rules in the grammar. A lexer is a
program that converts a sequence of characters into tokens. A lexer is not needed to parse binary
file formats, because the data types predicted by the parser are the tokens of the grammar.
6.2 Using ANTLR to Generate a Parser for a Binary File Format
It is possible to use ANTLR to test our binary array grammars in recognizing the chunk-based
and directory-based binary file formats. This is accomplished by writing a lexer that treats each
input byte in a file as a character token. Then functions are created for each data type in the
binary file grammar that convert the appropriate number of character tokens to binary data types,
e.g., int 16, int32, etc. This is effective, if someone clumsy, and has allowed us to test our
attribute grammars for binary file formats as well as to demonstrate the feasibility of creating a
parser generator for binary array grammars.
13
6.2.1 An ANTLR Grammar for an ILBM Chunk-based File Format
Figure 8 shows a specification for the ILBM file format that is input to ANTLR. The output of
ANTLR is a parser for ILBM files.
grammar iblm;
options {
language = Java; // The programming language of the output parser
TokenLabelType = CommonToken; }
// The text contained in @header is placed at the top of the parser file.
@header {
package gtri.org.grammars.chunk;
import java.io.UnsupportedEncodingException;}
// The text in @lexer::header is placed at the top of the lexer file.
@lexer::header {
package gtri.org.grammars.chunk;
import java.io.UnsupportedEncodingException;}
// The text in the @members section is placed inside the parser class.
@members {
// getUByteValue inputs the token of a Byte terminal and returns an
// integer value.
public int getUByteValue(Token tok) throws UnsupportedEncodingException
{
int intValue = 0;
char charValue = tok.getText().charAt(0);
intValue = (int) charValue;
intValue = intValue & 0x00ff;
return intValue;} }
// Grammar Rules used to create the Parser. Start symbol ilbm
iblm returns [boolean result]
:
FORM ckSize ILBM
// Print out ILBM chunksize.
{ System.out.println("ID: ILBM Size: " + $ckSize.value);}
// Semantic predicate that ensures that at least on BMHD is found.
propertyChunks {$propertyChunks.foundBMHD == true}?
dataChunks?
body?
{ $result = true; };
// The bmhd, cmap and camg property chunks can occur in any order.
// At least one BMHD data chunk must be present.
propertyChunks returns [boolean foundBMHD]
: (bmhd { $foundBMHD = true; }
| cmap
| camg)+;
bmhd
: BMHD ckSize
{
System.out.println("ID: BMHD Size: " + $ckSize.value);
}
bitMapHeader;
bitMapHeader
: width = uWordValue
14
height = uWordValue
xPosition = wordValue
yPosition = wordValue
nPanes = uByteValue
masking = uByteValue
compression = uByteValue
reserved1 = uByteValue
transparentColor = uWordValue
xAspect = uByteValue
yAspect = uByteValue
pageWidth = wordValue
pageHeight = wordValue
// Print out values in the bitmap structure.
{
System.out.println(" Width: " + $width.value);
System.out.println(" Height: " + $height.value);
System.out.println(" X Position: " + $xPosition.value);
System.out.println(" Y Position: " + $yPosition.value);
System.out.println(" Number of Panes: " + $nPanes.value);
System.out.println(" Masking:" + $masking.value);
System.out.println(" Compression:" + $compression.value);
System.out.println(" Reserved1:" + $reserved1.value);
System.out.println(" TransParent Color:" +
$transparentColor.value);
System.out.println(" X Aspest:" + $xAspect.value);
System.out.println(" Y Aspect:" + $yAspect.value);
System.out.println(" Page Width:" + $pageWidth.value);
System.out.println(" Page Height:" + $pageHeight.value);
};
cmap returns [long value]
// Initialize the index value i in the cmap function.
@init { int i = 0; }
:
CMAP ckSize
// Print out chunk size.
{ $value = $ckSize.value;
System.out.println("ID: CMAP Size: " + $ckSize.value);
}
(colorRegister
// Print out color values.
{ System.out.println(" Color[" + i++ + "]: {" +
$colorRegister.redValue +
", " + $colorRegister.greenValue +
", " + $colorRegister.blueValue +
"}");
})+;
colorRegister returns [int redValue, int greenValue, int blueValue]
: red = uByteValue green = uByteValue blue = uByteValue
{ $redValue = $red.value;
$greenValue = $green.value;
$blueValue = $blue.value;
};
camg
: CAMG ckSize longValue
{ System.out.println("ID: CAMG Size: " + $ckSize.value);
15
System.out.println(" View Modes: 0x" +
Long.toHexString($longValue.value));
};
// The crng and ccrt data chunks
dataChunks
: crng
| ccrt;
crng
: CRNG ckSize
{ System.out.println("ID: CRNG Size: " + $ckSize.value); }
cRange;
cRange
: pad1 = wordValue {$pad1.value == 0}?
rate = wordValue
active = wordValue
low = uByteValue
high = uByteValue ;
ccrt
: CCRT ckSize
{ System.out.println("ID: CCRT Size: " + $ckSize.value); }
cycleInfo;
cycleInfo
: direction = wordValue
start = uByteValue
end = uByteValue
seconds = longValue
microseconds = longValue
pad = wordValue {$pad.value == 0}? ;
body
: BODY ckSize
{ System.out.println("ID: BODY Size: " + $ckSize.value); }
{$ckSize.value > 0}? data
{ System.out.println(" Number of bytes: " + $data.value); }
{$data.value == $ckSize.value}?;
// defines the data chunk of the body chunk
data returns [long value]
: (b+=BYTE)+ {$b.size() > 0}?
{ $value = $b.size(); };
// The cksize is the value of a long integer.
ckSize returns [long value]
: longValue
{$value = $longValue.value;};
// defines the longValue data typein terms of BYTE terminals
longValue returns [long value] throws UnsupportedEncodingException
: h1 = BYTE h2 = BYTE l1= BYTE l2 = BYTE
{$value = (long)
( getUByteValue($h1) << 24 |
getUByteValue($h2) << 16 |
getUByteValue($l1) << 8 |
getUByteValue($l2));
};
catch[RecognitionException re]{
reportError(re);
recover(input,re); }
catch[UnsupportedEncodingException uee]{
16
System.out.println("Unsupported Encoding Exception"); }
// defines the unsigned word data type
uWordValue returns [int value]
: b1 = BYTE b2 = BYTE
{$value = (int)
( getUByteValue($b1) << 8 |
getUByteValue($b2));
} ;
catch[RecognitionException re]{
reportError(re);
recover(input,re);
}
catch[UnsupportedEncodingException uee]{
System.out.println("Unsupported Encoding Exception");
}
// defines the signed word data type
wordValue returns [int value]
: b1 = BYTE b2 = BYTE
{$value = (short)
( getUByteValue($b1) << 8 |
getUByteValue($b2));
} ;
catch[RecognitionException re]{
reportError(re);
recover(input,re);
}
catch[UnsupportedEncodingException uee]{
System.out.println("Unsupported Encoding Exception");
}
// defines the unsigned word data type
uByteValue returns [int value]
: BYTE
// calls the getUByteValue from a BYTE token
{$value = (int) getUByteValue($BYTE);
};
catch[RecognitionException re]{
reportError(re);
recover(input,re);
}
catch[UnsupportedEncodingException uee]{
System.out.println("Unsupported Encoding Exception");
}
//Terminal rules used to create the Lexer
FORM : 'FORM' ;
ILBM : 'ILBM' ;
BMHD : 'BMHD' ;
CMAP : 'CMAP' ;
CAMG : 'CAMG' ;
CRNG : 'CRNG' ;
CCRT : 'CCRT' ;
BODY : 'BODY' ;
BYTE : . ;
Figure 8. An ANTLR Grammar for the ILBM File Format.
17
6.2.2 The Lexer and Parser
When provided the described in the previous section, ANTLR generates the ilbmlexer and
ilbmparser Java programs. The Java application shown in Figure 9 uses these programs to parse
the ILBM file BLK.IFF (a black rectangle) and extract the metadata shown in Figure 10.
package org.gtri.grammars.chunk;
import java.io.IOException;
import org.antlr.runtime.ANTLRFileStream;
import org.antlr.runtime.ANTLRFileStream;
import org.antlr.runtime.ANTLRStringStream;
import org.antlr.runtime.CharStream;
import org.antlr.runtime.CommonTokenStream;
import org.antlr.runtime.RecognitionException;
import org.antlr.runtime.TokenStream;
import gtri.org.grammars.chunk.iblmLexer;
import gtri.org.grammars.chunk.iblmParser;
public class Test {
static String filePath = "C:\\Users\\sl170\\Pictures\\IFF-ilbm\\BLK.IFF";
public static void main(String[] args) throws RecognitionException, IOException {
ANTLRFileStream stream = new ANTLRFileStream(filePath, "ISO-8859-1");
iblmLexer lexer = new iblmLexer(stream);
TokenStream tokenStream = new CommonTokenStream(lexer);
iblmParser parser = new iblmParser(tokenStream);
System.out.println("FILE: " + filePath);
System.out.println("Successfully parsed = " + parser.iblm());
}}
Figure 9. A Java Application for Parsing an ILBM File
18
FILE: C:\Users\sl170\Pictures\IFF-ilbm\BLK.IFF
ID: ILBM Size: 2556
ID: BMHD Size: 20
Width: 200
Height: 144
X Position: 0
Y Position: 0
Number of Panes: 6
Masking:0
Compression:1
Reserved1:0
TransParent Color:0
X Aspest:3
Y Aspect:3
Page Width:200
Page Height:144
ID: CMAP Size: 48
Color[0]: {0, 0, 0}
Color[1]: {0, 128, 128}
Color[2]: {0, 255, 255}
Color[3]: {255, 0, 255}
Color[4]: {0, 255, 0}
Color[5]: {255, 0, 0}
Color[6]: {0, 0, 255}
Color[7]: {255, 255, 0}
Color[8]: {128, 128, 128}
Color[9]: {192, 192, 192}
Color[10]: {128, 0, 0}
Color[11]: {0, 128, 0}
Color[12]: {128, 128, 0}
Color[13]: {0, 0, 128}
Color[14]: {128, 0, 128}
Color[15]: {255, 255, 255}
ID: CAMG Size: 4
View Modes: 0x800
ID: BODY Size: 2448
Number of bytes: 2448
Successfully parsed = true
Figure 10. Output of the ILBM Parser Applied to the File BLK.IFF
6.3 Improvements to the Lexical Analysis of Data Types
The ANTLR Lexer returns tokens made up of characters. For the simple sentence “This is a
simple string.” could return ten tokens, one token for each word, one token for each space and
one token for the period. The number of tokens the lexer returned would depend on the grammar
it was built from. The previous version of the ilbm grammar also created a lexer that groups
bytes intermixed with certain strings. Examples of some of the strings that were returned were
“FORM”, “ILBM”, “CAMG”, “BODY”. In some cases this worked fine. But in other cases it
caused problems. Whenever the lexer encountered a ‘C’ it expected the next byte to be an ‘M’,
‘A’, ‘R’, or another ‘C’. The lexer returned an error if the ASCII equivalent of the next byte was
19
not what it expected. This error would state what it was expecting and what it encountered
instead. The problem occurred because the same byte that can represent the ASCII ‘C’ could also
occur as part of a set of bytes that were supposed to represent a number. Another problem that
occurred with the previous grammar, was that the specification for this file type allowed for
unknown chunks as long as they had a chunk id that consisted of four capital letters followed by
the size of the chunk and the data in the chunk.
The lexer created from a chunk-based binary array attribute grammar, needs to simply return an
array of bytes. It needs to leave it up to the parser as to whether a group of bytes represents an
ASCII string or a number. Figure 11 shows a grammar that ANTLR uses to create a Lexer that
returns an array of bytes.
lexer grammar binaryArray;
//Terminal rules used to create the Lexer
BYTE
: . // The period is used to match any byte
;
Figure 11. Lexer for all Context-free Binary Array Grammars
Applications store data to a file using groups of bytes. Some applications store the most
significant byte of numeric data first. This is called big-endian representation. Other applications
store the least significant byte first. This is called little-endian representation. The applications
that store data in ILBM files use little-endian representation. To make use of this stored numeric
data, the data types that the bytes represent have to be translated into the appropriate numeric
representation.
A set of JAVA utilities have been created that can be referenced by binary attribute grammar
rules to return values of these big and little-endian data types. There are also data types that
return a string of characters and one that returns the hex representation of the bytes as they were
stored in the file. Many chunk-based file specifications allow for the addition of unknown
chunks. These chunks of data may contain data that is used by one application but can be ignored
by other applications. The data types of the data in these chunks are not known, but the size of
the array of data that they contain is known. In this case, if the data contains only ASCII
printable characters, the ASCII representation is returned. In the case that the data contain non-
printable characters, the hex representation of the data is returned. There is a utility for checking
whether a number is odd or even. This utility is needed because odd-sized chunks in chunk-based
files often have a pad byte added. These functions are shown in Figure 12.
20
public static boolean isOdd(long value) {
return ((value % 2) != 0);
}
public static boolean isPrintable(String str) {
if (str == null) {
return false;
}
int sz = str.length();
if (str.charAt(sz - 1) == 0)
sz--;
for (int i = 0; i < sz; i++) {
if (isPrintable(str.charAt(i)) == false) {
return false;
}
}
return true;
}
public static boolean isPrintable(char ch) {
return ((ch >= 32 && ch <= 127) ||
(ch >= 8 && ch <= 10) ||
(ch >= 12 && ch <= 13));
}
public static boolean isValidZString(String str) {
if (str == null) {
return false;
}
int sz = str.length();
return (str.charAt(sz - 1) == 0);
}
public static String uBytesToHex(String string) {
byte[] a = string.getBytes();
StringBuilder sb = new StringBuilder();
sb.append("0x");
for (byte b : a) {
sb.append(String.format("%02x", b & 0xff));
}
return sb.toString();
}
public static long uBytesToLong(String str, String byteOrder) {
if (byteOrder.toUpperCase().equals("LE")) {
str = new StringBuffer(str).reverse().toString();
}
byte[] b = str.getBytes();
return (long) (uByte(b[0]) << 24
| uByte(b[1]) << 16
| uByte(b[2]) << 8
| uByte(b[3]));
21
}
public static long uBytesToDWord(String str, String byteOrder) {
if (byteOrder.toUpperCase().equals("LE")) {
str = new StringBuffer(str).reverse().toString();
}
byte[] b = str.getBytes();
return (long) (uByte(b[0]) << 24
| uByte(b[1]) << 16
| uByte(b[2]) << 8
| uByte(b[3]));
}
public static int uBytesToUWord(String str, String byteOrder) {
if (byteOrder.toUpperCase().equals("LE")) {
str = new StringBuffer(str).reverse().toString();
}
byte[] b = str.getBytes();
return (int) (uByte(b[0]) << 8
| uByte(b[1]));
}
public static int uBytesToWord(String str, String byteOrder) {
if (byteOrder.toUpperCase().equals("LE")) {
str = new StringBuffer(str).reverse().toString();
}
byte[] b = str.getBytes();
return (short) (uByte(b[0]) << 8
| uByte(b[1]));
}
public static int uByte(byte b) {
return ((int) b & 0x00ff);
}
public static int uByte(String str) {
byte[] b = str.getBytes();
return ((int) b[0] & 0x00ff);
}
Figure 12. Data Translation and Other Gammar Utilities
ANTLR was designed to create parsers for textual languages. It has one data type. This is the
text data type that is returned by tokens. There are special grammar rules that use some of the
utilities in Figure 12 to allow binary attribute grammars to read a sequence of bytes and return a
data type other than text. These rules are shown in Figure 13. Together with the utilities in Figure
12, these grammar rules act as a work around that simulates a lexer that has bytes as input and
returns data types other than text.
22
Each of the rules for a numeric data type take a parameter byteOrder. This parameter is then
passed on to the translation utilities. This allows the bytes of numeric data to be correctly
translated. The byteOrder is either big-endian or little-endian depending on how the data was
originally stored in the file. For example, ilbm files are stored in big-endian order while wave
files are stored in little-endian order.
As long as chunk-based binary array grammars are built on top of the simplified lexer grammar
and the data type grammar, ANTLR 4 can be used to automatically create a lexer and parser that
can be used with chunk-based binary array grammars.
grammar dataType;
// The binaryArray grammar is a simple lexer grammar with BYTE as the only
// Token type.
import binaryArray;
// The following rules are a work around.
// data types instead of Antlr4's text only data type.
// They allow other binary grammars read and return
// data types instead of Antlr4's text only Token type.
// These parser and lexer rules simulate a lexer that
// inputs bytes and returns data type Token on demand.
// This rule inputs byte array of a given size. It uses a
// long as the size of the array to input and returns
// a string of all printable charactors of size or the
// hex representation of the array as a string that is
// length of size. dtData [long size] returns [String value]
: byteArray[$size]
{ if(Utilities.isPrintable($text))
$value = $text;
else
$value = Utilities.uBytesToHex($text);
}
;
dtString [long size] returns [String value]
: byteArray[$size]
{$value = $text;}
;
dtZString [long size] returns [String value]
: byteArray[$size - 1]
23
{$value = $text;}
dtUByte
{Utilities.isValidZString($text)}?
;
dtHexString [long size] returns [String value]
: byteArray[$size]
{$value = Utilities.uBytesToHex($text);}
;
dtLong [String byteOrder] returns [long value]
: byteArray[4]
{$value = Utilities.uBytesToLong($text, $byteOrder);}
;
dtDWord [String byteOrder] returns [long value]
: byteArray[4]
{$value = Utilities.uBytesToDWord($text, $byteOrder);}
;
dtUWord [String byteOrder] returns [int value]
: byteArray[2]
{$value = Utilities.uBytesToUWord($text, $byteOrder);}
;
dtWord [String byteOrder] returns [int value]
: byteArray[2]
{$value = Utilities.uBytesToWord($text, $byteOrder);}
;
dtUbyte returns [int value]
: byteArray[1]
{$value = Utilities.uByte($text);}
;
byteArray [long size]
locals [long index = 0]
: ({$index < $size}?
BYTE
{$index += 1;})+
;
Figure 13. Datatype rules used by all binary array attribute grammars
24
The single lexer rule and all the data type rules are a data type grammar. This makes it possible
to simply import the data type grammar into the specific grammar for each chunk-based file
format. A grammar for a file format needs rules that define the layout (structure) of the file
format plus the data type (lexical) rules to handle the data type definition and the translation of
numeric data into a usable form. A rewritten version of the ilbm grammar is shown in Figure 14.
It imports the data type rules.
grammar ilbm;
import dataType;
// Nonterminal rules used to create the Parser
ilbm
locals [String byteOrder = "BE", boolean foundBMHD]
: formID = ckID
{$formID.text.equals("FORM")}? // check that first ckID == FORM
ckSize
ilbmID = ckID
{$ilbmID.text.equals("ILBM")}? // check that second ckID == ILBM
propertiesLoop
body
;
propertiesLoop
locals [String ckName]
: chunkName1 = ckID
{$ckName = $chunkName1.text;}
(
{!$ckName.equals("BODY")}? // check for the end of the loop.
property[$ckName]
chunkName2 = ckID {$ckName = $chunkName2.text;}
)*
;
catch [FailedPredicateException fpe]
{
// Do nothing; ckName == BODY
// This exception is used as an exit for the propertyLoop
// It only occurrs when the next $ckName is equal to "BODY".
// The BODY is where the actual picture is stored.
}
25
property[String currentID]
: {$currentID.equals("BMHD")}? bmhd
| {$currentID.equals("CMAP")}? cmap
| {$currentID.equals("CAMG")}? camg
| {$currentID.equals("CRNG")}? crng
| additionalProperty[currentID]
;
bmhd
: ckSize {$ckSize.value == 20}?
bitMapHeader
{$ilbm::foundBMHD = true;}
;
bitMapHeader
: width = dtUWord[$ilbm::byteOrder]
height = dtUWord[$ilbm::byteOrder]
xPosition = dtWord[$ilbm::byteOrder]
yPosition = dtWord [$ilbm::byteOrder]
nPanes = dtUByte
masking = dtUByte
compression = dtUByte
reserved1 = dtUByte
transparentColor = dtUWord[$ilbm::byteOrder]
xAspect = dtUByte
yAspect = dtUByte
pageWidth = dtWord[$ilbm::byteOrder]
pageHeight = dtWord[$ilbm::byteOrder]
;
cmap
locals [int count, int index]
@init {
$count = 1;
$index = 0;
}
: ckSize {($ckSize.value % 3) == 0}?
({$count <= $ckSize.value}? colorRegister[$index++] {$count += 3;})+
;
26
colorRegister [int index]
: red = dtUByte green = dtUByte blue = dtUByte
;
camg
: ckSize dtLong[$ilbm::byteOrder]
{Long.toHexString($dtLong.value);}
;
crng
: ckSize
cRange
;
cRange
: pad1 = dtWord[$ilbm::byteOrder] {$pad1.value == 0}?
rate = dtWord[$ilbm::byteOrder]
active = dtWord[$ilbm::byteOrder]
low = dtUByte
high = dtUByte
;
additionalProperty [String currentID]
: ckSize
dtData[$ckSize.value]
;
body
: {$ilbm::foundBMHD}?
ckSize dtHexString[$ckSize.value]
;
ckSize returns [long value]
: dtLong[$ilbm::byteOrder]
{$value = $dtLong.value;}
;
ckID
: dtString[4] {$text.equals($text.toUpperCase())}?
;
Figure 14. The ILBM File Format Grammar
27
A wave_form grammar is shown in Figure 15. It also imports the datatype rules.
grammar wave_form; // Define a grammar called wave_form
import dataType;
// Nonterminal rules used to create the Parser
wave_form
locals [String byteOrder = "LE", boolean foundFmtCk]
: riff = dtString[4] {$riff.value.equals("RIFF")}?
dtDWord[$byteOrder]
wave = dtString[4] {$wave.value.equals("WAVE")}?
chunkOrListLoop
;
chunkOrListLoop
: (ckID = dtString[4]
chunkOrList[$ckID.value])+
;
chunkOrList[String ckID]
: {$ckID.equals("fmt ")}? fmt_ck {$wave_form::foundFmtCk = true;}
| {$ckID.equals("fact")}? fact_ck
| {$ckID.equals("data") && $wave_form::foundFmtCk}? data_ck
| {$ckID.equals("cue ")}? cue_ck
| {$ckID.equals("plst")}? playlist_ck
| {$ckID.equals("LIST")}?
(ckSize = dtDWord[$wave_form::byteOrder]
typeList = dtString[4]
listType[$ckSize.value, $typeList.value])
| unknownChunk
;
fmt_ck
: ckSize = dtDWord[$wave_form::byteOrder]
common_fields
format_specific_fields[$common_fields.formatTag,
$ckSize.value - 14]
({Utilities.isOdd($ckSize.value)}? pad_byte)?
;
common_fields returns [int formatTag]
28
: // Format category
wFormatTag = dtUWord[$wave_form::byteOrder]
{$formatTag = $wFormatTag.value;}
// Number of channels
wChannels = dtUWord[$wave_form::byteOrder]
// Sampling rate
dwSamplesPerSec = dtDWord[$wave_form::byteOrder]
// For buffer estimation
dwAvgBytesPerSec = dtDWord[$wave_form::byteOrder]
// Data block size
wBlockAlign = dtUWord[$wave_form::byteOrder]
;
/** <format-specific-fields> consist of zero or more bytes of parameters.
* Which parameters occur depends on the WAVE format category. Validation
* should allow for (and ignore) and unknown parameters. */
format_specific_fields [int formatTag, long sizeRest]
returns[int bitsPerSample]
: ({$sizeRest > 0}?
({$formatTag == 1}? (pcm_specific_fields[$sizeRest]
{$bitsPerSample = $pcm_specific_fields.bitsPerSample;})
| dtHexString[$sizeRest])
|)
;
pcm_specific_fields [long sizeRest] returns[int bitsPerSample]
: wBitsPerSample = dtUWord[$wave_form::byteOrder] // Sample rate
{$bitsPerSample = $wBitsPerSample.value;}
({($sizeRest - 2) > 0}? dtHexString[$sizeRest - 2])?
;
fact_ck
: ckSize = dtDWord[$wave_form::byteOrder]
dwFileSize = dtDWord[$wave_form::byteOrder]
;
data_ck
: ckSize = dtDWord[$wave_form::byteOrder]
dtHexString[$ckSize.value]
({Utilities.isOdd($ckSize.value)}? pad_byte)?
29
;
cue_ck
: dtDWord[$wave_form::byteOrder]
dwCuePoints = dtDWord[$wave_form::byteOrder]
cue_point_array[$dwCuePoints.value]
;
cue_point_array [long size]
locals [long index]
: ({$index < $size}?
cue_point
{$index += 1;})+
;
cue_point
: dwName = dtDWord[$wave_form::byteOrder]
dwPosition = dtDWord[$wave_form::byteOrder]
fccChunk = dtString[4]
dwChunkStart = dtDWord[$wave_form::byteOrder]
dwBlockStart = dtDWord[$wave_form::byteOrder]
dwSampleOffset = dtDWord[$wave_form::byteOrder]
;
playlist_ck
: ckSize = dtDWord[$wave_form::byteOrder]
dwSegments = dtDWord[$wave_form::byteOrder]
play_segment_array[$dwSegments.value]
;
play_segment_array [long size]
locals [long index]
: ({$index < $size}?
play_segment
{$index += 1;})+
;
play_segment
: dwName = dtDWord[$wave_form::byteOrder]
dwLength = dtDWord[$wave_form::byteOrder]
dwLoops = dtDWord[$wave_form::byteOrder]
;
30
listType[long size, String typeList]
: {$typeList.equals("adtl")}? assoc_data_list[$size]
| {$typeList.equals("wavl") && $wave_form::foundFmtCk}?
wave_list[$size]
| {$typeList.equals("INFO")}? info_list[$size]
| unknownChunkLoop[size]
;
wave_list[long size]
locals[long index = 4]
: ({$index < $size}?
waveListType = dtString[4] dataOrSlnt[$waveListType.value]
{$index += 4 + $dataOrSlnt.text.length();})+
;
dataOrSlnt[String dataType]
: {$dataType.equals("data")}? data_ck
| {$dataType.equals("slnt")}? slnt_ck
;
slnt_ck
: ckSize = dtDWord[$wave_form::byteOrder]
dtHexString[$ckSize.value]
({Utilities.isOdd($ckSize.value)}? pad_byte)?
;
assoc_data_list[long size]
locals[long index = 4]
: ({$index < $size}?
assoc_dataType = dtString[4] assoc_data_type[$assoc_dataType.value]
{$index += 4 + $assoc_data_type.text.length();})+
;
assoc_data_type[String assoc_dataType]
: {$assoc_dataType.equals("labl")}? labl_ck
| {$assoc_dataType.equals("note")}? note_ck
| {$assoc_dataType.equals("ltxt")}? ltxt_ck
| {$assoc_dataType.equals("file")}? file_ck
;
labl_ck
: ckSize = dtDWord[$wave_form::byteOrder]
dwName = dtDWord[$wave_form::byteOrder]
31
labal = dtZString[$ckSize.value - 4]
({Utilities.isOdd($ckSize.value)}? pad_byte)?
;
note_ck
: ckSize = dtDWord[$wave_form::byteOrder]
dwName = dtDWord[$wave_form::byteOrder]
comment = dtZString[$ckSize.value - 4]
({Utilities.isOdd($ckSize.value)}? pad_byte)?
;
ltxt_ck
: ckSize = dtDWord[$wave_form::byteOrder]
dwName = dtDWord[$wave_form::byteOrder]
dwSampleLength = dtDWord[$wave_form::byteOrder]
dwPurpose = dtString[4]
wCountry = dtUWord[$wave_form::byteOrder]
wLanguage = dtUWord[$wave_form::byteOrder]
wDialect = dtUWord[$wave_form::byteOrder]
wCodePage = dtUWord[$wave_form::byteOrder]
({$ckSize.value - 20 > 0}? dtData[$ckSize.value - 20])?
({Utilities.isOdd($ckSize.value)}? pad_byte)?
;
file_ck
: ckSize = dtDWord[$wave_form::byteOrder]
dwName = dtDWord[$wave_form::byteOrder]
dwMedType = dtDWord[$wave_form::byteOrder]
({$ckSize.value - 8 > 0}? dtData[$ckSize.value - 8])?
({Utilities.isOdd($ckSize.value)}? pad_byte)?
;
info_list [long size]
locals [long index = 4]
: ({$index < $size}?
typeInfo = dtString[4]
info_type[$typeInfo.value]
{$index += $info_type.text.length() + 4;})+
;
info_type[String typeInfo]
32
: {$typeInfo.equals("ISFT")}? isft
| {$typeInfo.equals("IENG")}? ieng
| {$typeInfo.equals("ICRD")}? icrd
| unknownChunk
;
isft
: strSize = dtDWord[$wave_form::byteOrder]
software = dtZString[$strSize.value]
({Utilities.isOdd($strSize.value)}? pad_byte)?
;
ieng
: strSize = dtDWord[$wave_form::byteOrder]
engineer = dtZString[$strSize.value]
({Utilities.isOdd($strSize.value)}? pad_byte)?
;
icrd
: strSize = dtDWord[$wave_form::byteOrder]
creationDate = dtZString[$strSize.value]
({Utilities.isOdd($strSize.value)}? pad_byte)?
;
unknownChunk
: dtDWord[$wave_form::byteOrder]
dtData[$dtDWord.value]
({Utilities.isOdd($dtDWord.value)}? pad_byte)?
;
pad_byte
: dtUByte
;
unknownChunkLoop [long size]
locals [long index = 4]
: ({$index < $size}?
dtString[4]
unknownChunk
{$index += $unknownChunk.text.length() + 4;})+
;
Figure 15. The Wave_Form File Format Grammar
33
6.4 Generating Parse Trees for Binary Array Grammars
There are two ways to generate parse trees using ANTLR. One way is to create a tree grammar.
The other way is to embed print statements as actions in the grammar itself. Both methods are
awkward and make it difficult to both read and specify grammars. With the advent of ANTLR4
[Parr 2012], this has changed. ANTLR4 has a built in parse tree walker. It automatically creates
nodes for each rule in the grammar and a terminal node for each token.
6.4.1 How ANTLR4 Generates Parse Trees for String Attribute Grammars
ANTLR4 provides a command line parameter that allows the user to specify whether they would
like the parse tree displayed as a nested list style tree or as a graphical parse tree in a graphical
window. Figure 16 shows a parse tree displayed as a nested list.
(prog (stat (expr 193) \r\n) (stat a = (expr 5) \r\n) (stat b = (expr 6) \r\n) (stat (expr (expr a) +
(expr (expr b) * (expr 2))) \r\n) (stat (expr (expr ( (expr (expr 1) + (expr 2)) )) * (expr 3)) \r\n))
Figure 16. Parse tree shown as a nested list.
Figure 17 shows the parse tree output in a graphical user interface window.
Figure 17. Parse tree shown as a graph.
34
6.4.2 Using ANTLR4 to Generate Parse Trees for Binary Attribute Grammars
There are two problems with using the ANTLR4 function for displaying the parse tree as a
nested list. Since the parse tree is displayed as one continuous string, the parse tree becomes
almost unreadable when the input file is longer than a couple of lines. If one uses the ANTLR4
parse tree function on an attribute grammar that has been extended to enable binary data types,
the output becomes even more unreadable. In string grammars, a token is a printable string. In a
binary array grammar, the byte is not always a printable character. When a printable character
for a byte is unavailable, a space is used by ANTLR4. Also, in the case of binary grammars, data
types are represented as byte arrays of different lengths. Even a string is read as an array of byte
tokens where each token value is separated by a space. The data type long integer is an array of 4
bytes. The decimal number that the bytes of data represent is what is important, not the
individual bytes.
Figure 18 shows a display of the ILBM formated file Docking.bru.
Figure 18. ILBM File Docking.bru
Figure 19 shows a nested list style parse tree created from a parse of the file using the ILBM
ANTLR grammar shown in Figure 14. This single string requires 3 pages to printout and is
almost unreadable. Most of the terminal bytes are represented as spaces or odd characters. The
actual size of the image is shown to be 6594 bytes. However, to conserve space in this paper, in
Figure 19 only the first three bytes and the last three bytes of the image are shown separated by
an ellipsis.
35
(ilbm (ckID (dtString (byteArray F O R M))) (ckSize (dtLong (byteArray � D)))
(ckID (dtString (byteArray I L B M))) (propertiesLoop (ckID (dtString (byteArray B M H
D))) (property (bmhd (ckSize (dtLong (byteArray �))) (bitMapHeader (dtWord
(byteArray � ¹)) (dtWord (byteArray � H)) (dtWord (byteArray )) (dtWord (byteArray
)) (dtUByye (byteArray �)) (dtUByye (byteArray )) (dtUByye (byteArray �)) (dtUByye
(byteArray €)) (dtWord (byteArray )) (dtUByye (byteArray �)) (dtUByye (byteArray
)) (dtWord (byteArray � ¹)) (dtWord (byteArray � ?))))) (ckID (dtString (byteArray A N
N O))) (property (additionalProperty (ckSize (dtLong (byteArray H))) (dtData
(byteArray $ V E R : W r i t t e n b y A S D G ' s A r t D e p a r t m e n t
P r o f e s s i o n a l I F F 3 . 0 . 4 ( 0 5 . 0 2 . 9 5 ) )))) (ckID (dtString
(byteArray C M A P))) (property (cmap (ckSize (dtLong (byteArray )))
(colorRegister (dtUByye (byteArray ÿ)) (dtUByye (byteArray ÿ)) (dtUByye (byteArray
ÿ))) (colorRegister (dtUByye (byteArray )) (dtUByye (byteArray )) (dtUByye
(byteArray ))) (colorRegister (dtUByye (byteArray ÿ)) (dtUByye (byteArray ))
(dtUByye (byteArray ))) (colorRegister (dtUByye (byteArray ÿ)) (dtUByye (byteArray
f)) (dtUByye (byteArray ))))) (ckID (dtString (byteArray C A M G))) (property (camg
(ckSize (dtLong (byteArray �))) (dtLong (byteArray € �)))) (ckID (dtString
(byteArray D P I ))) (property (additionalProperty (ckSize (dtLong (byteArray
�))) (dtData (byteArray ¥ � ,)))) (ckID (dtString (byteArray B O D Y)))) (body
(ckSize (dtLong (byteArray � ))) (dtHexString (byteArray © © … © ))))
Figure 19. Parse Tree Created by ANTLR4
The parse tree of binary array grammars can be made more readable by (1) pretty printing the
nested lists so that the levels of the tree can be shown, and (2) by outputting the values returned
by the datatype functions.
6.4.3 Extensions to ANTLR4 to Generate Pretty Printed Parse Trees with Data
Type Values
ANTLR4 creates the nested list style parse tree by using a built-in parse tree walker. When
ANTLR4 generates a parser, it creates a RuleContext for every rule. Each RuleContext has a
RuleNode interface. The RuleNode contains a list of children, and an index into both a list of
Rules and a list of RuleNames. A RuleNode child can be either a RuleNode or a TerminalNode.
The TerminalNode is associated with text that matches its TokenType. When an ANTLR4
created parser calls its top rule, it returns the rule’s RuleContext. To get the parse tree string the
user calls the toStringTree method of the returned context. The toStringTree method creates a
string with its ruleName as the head of the list followed by a space then appends resulting strings
from its recursive calls to the toStringTree method of all its children. The recursive calling stops
when it reaches a TerminalNode, which simply returns a string containing its matching text.
A group of utilities, called prettyPrintTree, was created to pretty print a tree. This overloaded
function behaves in a manner similar to the toStringTree method. This method takes a TreeNode
as a parameter and walks the RuleNodes until it finds a RuleNode with “dt” as the start of its
ruleName. As shown in Figure 14, every data type rule starts with “dt”. All the data type rules
call binaryArray to build the data type from the stored data. Every data type rule has a return
36
value called “value”. When the data type rule is reached it returns the string value of its “value”
attribute. Another difference between the toStringTree method and the prettyPrintTree method is
that while ruleName is followed by a space in toStringTree, it is followed by a newline character
and is indented according to the depth of its embedding in prettyPrintTree. The prettyPrintTree
utility methods are shown in Appendix C.
Though the method is called prettyPrintTree, it does not actually print the parse tree. It formats
the parse tree for printing or displaying in a window.
Figure 20 shows the parse tree that is result of calling the prettyPrintTree utility method with the
top-level RuleContext returned from parsing the DOCKING.BRU ilbm file. Only a few of the
6594 bytes in the image are shown in the figure. The pretty printed parse tree is easier to read
and understand than the nested lists shown in Figure 19.
(ilbm
(ckID
(dtString
FORM))
(ckSize
(dtLong 6724))
(ckID
(dtString
ILBM))
(propertiesLoop
(ckID
(dtString
BMHD))
(property
(bmhd
(ckSize
(dtLong 20))
(bitMapHeader
(dtWord 706)
(dtWord 328)
(dtWord 0)
(dtWord 0)
(dtUByye 2)
(dtUByye 0)
(dtUByye 1)
(dtUByye 194)
(dtWord 0)
(dtUByye 20)
(dtUByye 11)
(dtWord 706)
(dtWord 450))))
(ckID
(dtString
ANNO))
(property
(additionalProperty
37
(ckSize
(dtLong 72))
(dtData
$VER: Written by ASDG's Art Department Professional IFF3.0.4 (05.)))
(ckID
(dtString
CMAP))
(property
(cmap
(ckSize
(dtLong 12))
(colorRegister
(dtUByye 195)
(dtUByye 195)
(dtUByye 195))
(colorRegister
(dtUByye 0)
(dtUByye 0)
(dtUByye 0))
(colorRegister
(dtUByye 195)
(dtUByye 0)
(dtUByye 0))
(colorRegister
(dtUByye 195)
(dtUByye 102)
(dtUByye 0))))
(ckID
(dtString
CAMG))
(property
(camg
(ckSize
(dtLong 4))
(dtLong 49792)))
(ckID
(dtString
DPI ))
(property
(additionalProperty
(ckSize
(dtLong 4))
(dtData
0x00c2a5012c)))
(ckID
(dtString
BODY)))
(body
(ckSize
(dtLong 6594))
(dtHexString
0xc2a900c2a900c2a900c2a900c2a900c2a900c2a900c39b000101c3bec3bd00017
c380c39700c2a900c39b0001c3bfc3bec3bd00027fc3bfc280c39800c2a900c39c0
023fc3bfc3bec3bd00027fc3bfc3bec39800c2a900c39d000307c3bfc3bfc3bec3b
38
. . .
00c39c00023fc3bfc3bec3bd00027fc3bfc3bec39800c2a900c39b0001c3bfc3bec
bd00027fc3bfc280c39800c2a900c39b000103c3bec3bd00017fc380c39700c2a90
c2a900c2a900c2a900c2a900c2a900c2a900c2a900c2a900c2a900c2a900c2a9000)))
Figure 20. Pretty Printed Parse Tree of ILBM Formated File Docking.bru
6.4.4 How ANTLR4’s Attribute Grammars, Semantic Predicates and Action
Clauses Help with Parsing and Validation
Along with the ability to pass attribute values to and from rules, ANTLR4 allows rules to create
local variables that can be seen and modified by rules below them in the parse tree. It is best to
pass variables to and from rules using its parameters. But, this is not always efficient. Many file
formats use a single byte order for the entire file. It is inconvenient to pass the byte order from
every rule down to the data type level. By setting a local variable at the top level of a grammar,
all rules in the grammar can access the variable by prefacing the variable name with the top-level
rule’s name followed by two colons. Any rule can access one of its ancestor’s local variables in
this way. As seen in Figures 14 and 15, the ilbm grammar and wave_form grammars each have
a top-level variable defined as a string called byteOrder. In the ilbm grammar byteOrder is set to
“BE” for big endian and in the wave_form grammar the byteOrder is set to “LE” for little endian.
Semantic predicates can be used in two ways. They can be used in guiding the parser or in
validating the file format. The property rule below is an example of semantic predicates being
used to guide the parse.
property[String currentID]
: {$currentID.equals("BMHD")}? bmhd
| {$currentID.equals("CMAP")}? cmap
| {$currentID.equals("CAMG")}? camg
| {$currentID.equals("CRNG")}? crng
| additionalProperty[currentID] ;
The value of the semantic predicate currentID is not available until it is read from the file.
Without this value, there no way to determine which rule should be called next. When the
currentID is equal to the string “BMHD”, there are only two possible following rules—the bmhd
rule or the additionalProperty rule. Because the bmhd path occurs first, it is chosen. If all the
currentID tests fail, the additionalProperty rule would be the only possible rule. In that case, the
value of chunk size is used. If the chunk size evaluates to odd, there is a pad byte at the end of
the chunk.
Another use of semantic predicates is in validation of file formats. The top-level rule of the
wave_form grammar shown below uses semantic predicates to support validation. Semantic
39
predicates used in this way are similar to assert statements in the C programming language. If the
predicates evaluate to false, a failed predicate exception is thrown by the parser and everything
following the predicate is ignored. The local foundFmtCk variable at the top-level is set to true
when the format chunk rule is successful. Later, the foundFmtCk will be used as a semantic
predicate to validate that the format chunk comes before the data chunk or before the LIST
chunk with the list type “wavl”.
wave_form
locals [String byteOrder = "LE", boolean foundFmtCk]
: riff = dtString[4] {$riff.value.equals("RIFF")}?
dtDWord[$byteOrder]
wave = dtString[4] {$wave.value.equals("WAVE")}?
chunkOrListLoop ;
ANTLR4’s action clauses can also be used to guide the parser or to aid in validating the file
format. In action clauses, mathematical computations can be performed and variables can be
assigned values. An if statement inside an action clause can be used to throw an exception or to
send a message to the standard output or to the error output. The bmhd rule below shows the
top-level variable ilbm foundBMHD being set to true after successfully parsing the bitmap
header chunk. The body rule shows the foundBMHD value being tested for the purpose of
validating the file format.
bmhd
: ckSize {$ckSize.value == 20}?
bitMapHeader
{$ilbm::foundBMHD = true;} ;
body
: {$ilbm::foundBMHD}?
ckSize dtHexString[$ckSize.value] ;
Shown below is the unknownChunkLoop rule that is found in the waveform grammar. It is
called when a list is encountered that has an unknown list type. There is no way to know how
many chunks it contains or what the chunks are. The action clause increments the index so the
loop can be terminated when the size of the accumulated chunks equal the size of the list chunk.
unknownChunkLoop [long size]
locals [long index = 4]
: ({$index < $size}?
dtString[4]
unknownChunk
{$index += $unknownChunk.text.length() + 4;})+ ;
40
7 Related Research
Researchers have developed a number of data description languages to support validation of file
formats. EAST (Enhanced ADA Subset) is a data description language developed by the
Consultative Committee for Space Data Systems [2010]. The Data Entity Dictionary
Specification Language (DEDSL) can be used in conjunction with EAST for defining semantic
information. The EAST description is used to interpret and provide access to information in
binary and text files.
ASN.1 (Abstract Syntax Notation One) is an international standard which aims a specifying the
structure of telecommunication and computer networking protocols between heterogeneous
systems [Larmouth, 1999; Dubuission 2000; Walkin 2006].
DATASCRIPT [Back 2002] supports specifying and parsing binary data and has been used to
manipulate Java jar files and ELF object files. The PADS/C data description language can be
used to define the file formats of text and binary files [Fisher & Gruber 2005]. The PADS/C
compiler compiles the description into tools that can be used to recognize, manipulate and
transform the data into other formats.
The Data Format Description Language (DFDL) is being developed by a Working Group of the
Open Grid Forum [OGF 2011]. The data description language is based on a subset of W3C XML
Schema using <xs:appinfo> annotations to carry the extra information necessary to describe non-
XML physical representations. Version 1 of the language specification was published in 2011
and a parser is being implemented.
Each of the data description languages described above can be used to define data types and file
structures of binary files. However, they are based on file structure declarations such as those of
C or Java. The binary file format grammar described in this paper most closely resembles DFDL.
However, the binary file grammar described in this paper is the only data description language
based on formal grammars that is used for creating parsers for file formats.
JHOVE (JSTOR/Harvard Object Validation Environment) is an API written in Java for the
validation of the formats of digital files [Harvard University Library 2011]. JHOVE supports
validation of the following formats: AIFF, GIF, HTML, JPEG, JPEG 2000, PDF, TIFF, WAVE,
XML and ASCII and UTF-8 encoded text. JHOVE2 is a rewrite of JHOVE. It currently contains
validation modules for the following formats: ICC Color Profile, SGML, ESRI Shapefile, TIFF,
WAVE, XML and UTF-8 encoded text [California Digital Library 2011].
The research reported in this paper is similar in intent to that of the JHOVE projects—validation
of binary file formats. As a matter of fact, binary array grammars and parsers are being
41
constructed for the chunk-based file formats AIFF, JPEG, JPEG 2000, WAVE and ESRI
Shapefile. However, the research reported herein differs from that of the JHOVE projects in that
what is sought is a technology for generating validators for binary file formats from grammars
specifying the binary file format.
8. Conclusion
The research question addressed by his research is whether it is possible to extend the context-
free grammars used to specify the syntax of programming languages to the specification of
binary file formats and to use these grammars with parsers for validating the file formats of
binary files. A major family of binary file formats, chunk-based, was described. Then extensions
to the concepts of context-free grammars and attribute grammars were described that enable the
specification of binary file formats. Examples of attribute array grammars for two chunk-based
binary file formats were then presented. Recursive descent parsers for this class of grammars was
then described. Experience in using ANTLR, a parser generator for LL(k) string grammars, in
generating parsers for recognizing the formats of binary files was discussed. Finally, related
research in data description languages for binary file formats was described and the JHOVE2
project which is creating Java-based tools for validating file formats was described.
It is concluded that it is possible to extend context-free grammars to the specification of chunk-
based binary file formats. Furthermore, these grammars can be used with recursive descent
parsers for parsing the file formats of chunk-based binary files. It remains to be determined
whether these attribute grammars based on binary array grammars are adequate to define other
families of binary file formats.
A capability to validate binary file formats assumes that the file format of a file is already
known. In this research, we use a file type identifier based on the UNIX file command and a
magic file that we have created [Underwood 2009]. We have samples of the chunk-based file
formats discussed in this report including those referenced in Appendices A and B. We also have
file signature tests (magic tests) for all these file types.
We would like to have a parser generator (also called a compiler–compiler) that would take as
input an attribute grammar for a binary file format and generate a parser or interpreter that
recognizes the file format or that interprets the file format and displays or plays the contents. The
closest parser generator that we have found to having these capabilities is ANTLR and ANTLR4
(www.antlr.org/ ). ANTLR will generate a top-down recursive descent parser from a context-free
attribute grammar. However, it works with strings, and generates a separate lexical analyzer.
42
Appendices A and B show the names of 86 file formats that belong to the family of “chunk-
based” file formats. We have specifications for most of these file formats and samples of files
with these formats.
These file formats have a similar structure. We have developed binary array grammars for two of
these binary file formats. We should be able to do so for all of them. We used these grammars
with the extensions to ANTLR4 to generate top-down recursive descent parsers for the file
formats defined with these grammars.
In this paper it is shown how to extend ANTLR4 to generate parsers for binary array attribute
grammars by creating lexical rules and functions for binary data types. It is also shown how to
generate parse trees represented as nested lists that can be pretty printed.
In this paper, it has not yet been demonstrated how to create a parser that validates binary file
formats with error messages for noncompliant syntax or semantics. Validation of binary file
formats via binary array attribute grammars will be demonstrated in a forthcoming report on
binary array attribute grammars for directory-based binary file formats.
The significance of success in this research task is that if binary file formats can be specified
with binary array attribute grammars, then only one parsing generator is needed to generate the
parsers for verifying that a file conforms to a particular format. Similarly, the same parsing
generator can possibly be used for conversion of legacy file formats to current or standard
formats. Finally, the same parser generator can possibly be used for generating viewers/players
for most file formats. This would increase the likelihood of preserving and making available into
the indefinite future those digital records encoded in binary file formats.
43
References
[3GPP2 2003] 3GPP2 C.S0050, 3GPP2 File Formats for Multimedia Services, File Format for
Multimedia Services for cdma2000. www.3gpp2.org/Public_html/specs/index.cfm
[Aho 1968] A. V. Aho. Indexed Grammars – An Extension of Context-free Grammars. Journal
of the ACM, Vol. 15, Issue 4, Oct. 1968.
[Alias-wavefront 1999] Alias-Wavefront. Maya File Formats, Maya 2.0, 1999.
http://caad.arch.ethz.ch/info/maya/manual/
[Amigan 2011] Amigan Software. IFF FORM and Chunk Registry
http://amigan.1emu.net/reg/iff.html
[Anisimovsky 2002a] Valery V. Anisimovsky. Electronic Arts Audio File Formats Description,
May 1, 2002. www.extractor.ru/articles/electronic_arts_audio_file_formats_description
[Anisimovsky 2002b] Valery V. Anisimovsky. Electronic Arts New Audio File Formats
Description, May 1 2002. www.wotsit.org/list.asp?search=electronic+arts&button=GO%21
[Apple 1989] Audio Interchange File Format: "AIFF" A Standard for Sampled Sound Files
Version 1.3 Apple Computer, Inc., January 1989.
www.aesnashville.org/PDFs/AIFF%20Specs.pdf
[Apple 1991] AIFF-C Apple Computer, Inc. Draft 08/26/91
http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/AIFF/Docs/AIFF-C.9.26.91.pdf
[Apple 1996] Apple Computer Inc. QuickTime File Format Specification, May 1996.
www.idea2ic.com/File_Formats/QuickTime%20File%20Format.pdf
[Apple 2001] Apple Computer, Inc. QuickTime File Format, 2001-03-01.
https://developer.apple.com/standards/classicquicktime.html
[Apple 2005] Apple. Core Audio Format Specification, 2005
http://developer.apple.com/library/mac/#documentation/MusicAudio/Reference/CAFSpec/CAF_
spec/CAF_spec.html
[Apple 2007] Apple Inc. QuickTime File Format Specification, 2007.
http://developer.apple.com/library/mac/#documentation/QuickTime/QTFF/QTFFPreface/qtffPref
ace.html
[Autodesk 1997] Autodesk Ltd. 3D Studio File Format (3ds). Document Revision 0.93 - January
1997 www.dcs.ed.ac.uk/home/mxr/gfx/3d/3DS.spec
44
[Autodesk 2009] Autodesk. DXF Reference, January 2009.
http://usa.autodesk.com/adsk/servlet/item?linkID=10809853&id=12272454&siteID=123112
[Back 2002] G. Back. (2002) A specification and scripting language for binary data. Generative
Programming and Component Engineering, Vol. 2487, Lecture Notes in Computer Science, 66-
77.
[Barth 1979] G. Barth. Fast recognition of context-sensitive structures. Journal of Computing, V
22, 1979, pp. 243-256.
[Baum & Noel 2002]D. Baum & G. Noel. The Simms Interchange File Format, 2002
http://simtech.sourceforge.net/tech/iff.html
[Bayless 1987] James Bayless. WORD (word processing FORM used by ProWrite) New
Horizons Software, Inc. 1987. http://amigan.1emu.net/reg/WORD.txt
[Beham 1995] H. Beham. DigiTrekker Module Format (*.DTM) rev1.0, 1995.
ftp://ftp.modland.com/pub/documents/format_documentation/DigiTrekker%20(.dtm).txt
[Blank Aspect 2010a] blankaspect.org. NLF (Nested List File) format, December 2010.
http://onda.sourceforge.net/nlf/nlfFormat.html
[Blank Aspect 2010b] blankaspect.org. Onda lossless audio compression algorithm & Onda file
formats, March 2010. http://onda.sourceforge.net/ondaAlgorithmAndFileFormats.html
[California Digital Library 2011] California Digital Library. JHOVE2 User’s Guide. 2011
https://bytebucket.org/jhove2/main/wiki/documents/JHOVE2-Users-Guide_20110222.pdf
[Caligari 1998] Caligri. Caligari trueSpace file format, Document Version 6, July
2002. http://www.caligari.com/download/zip/scncob65.zip
[CCITT 1992] CCITT. Terminal Equipment And Protocols for Telematic Services Information
Technology Digital Compression and Coding of Continuous-Tone Still Images Requirements
and Guidelines Recommendation T.81, September 1992. http://www.w3.org/Graphics/JPEG/itu-
t81.pdf
[CCSDS 2010] Consultative Committee for Space Data Systems. (2010) The Data Description
Language EAST Specification. http://public.ccsds.org/publications/archive/644x0b3.pdf
Compuphase. The FLIC File Format. www.compuphase.com/flic.htm
[Corel 1998] Corel Corporation, CMX File Format Ver. 8.0, 31 March 1998.
45
[Crazy Talk] Crazy Talk. Animator Pro file Formats. www.martinreddy.net/gfx/anim/FLC.txt
[Cunniff and Orr] Ross Cunniff and John Orr. 2D Standard Object Format: FORM DR2D.
Taliesin, Inc. http://amigan.1emu.net/reg/DR2D.txt
[Cuturicu and Fromme 1999] Cuturicu, Cristi and Oliver Fromme. .CRYX's note about the JPEG
decoding algorithm., 1999. www.coco3.com/text/doc_JPEG.txt
[Dubuission 2000] O. Dubuission. ASN.1 Communication between heterogeneous systems.
2000. www.oss.com/asn1/resources/books-whitepapers-pubs/dubuisson-asn1-book.PDF
[Dunckley et al 2007] Dunckley, M., Rankin, S., Conway, E. & Giaretta, D. (2007) The use of
file description languages for file format identification and validation. Proc. PV 2007
Conference: Ensuring the Long-Term Preservation and Value Adding to Scientific and Technical
Data, DLR, Oberpfaffenhofen/Munich, Germany. http://epubs.cclrc.ac.uk/work-
details?w=50089
[EBU 2001] European Broadcasting Union. Specification of the EBU Broadcast Wave Format,
Tech 3285, July 2001. http://tech.ebu.ch/docs/tech/tech3285.pdf
[Ecma 2009] Ecma International. JPEG File Interchange Format (JFIF). ECMA TR/98 1st ed.,
Ecma International, 2009. www.ecma-international.org/publications/techreports/E-TR-098.htm
[ESRI 1998] ESRI Shapefile Technical Description. July 1998.
www.esri.com/library/whitepapers/pdfs/shapefile.pdf
[ETSI 2012] ETSI TS 26.244; Transparent end-to-end packet switched streaming service (PSS);
3GPP file format (3GP) 2012 www.3gpp.org/ftp/Specs/html-info/26244.htm
Frans Faase. BFF: A Grammar for Binary File Formats. www.iwriteiam.nl/Ha_BFF.html
[Fercoq 1996] Robin Fercoq, 3D-Studio Material-Library File Format (.mli), Autodesk Ltd.,
May 1996. www.mediatel.lu/workshop/graphic/3D_fileformat/h_3ds2.html
[Ferguson 1995] Stuart Ferguson. Lightwave 3d Layered Object File Format. Lightwave, 10/4/95
www.mediatel.lu/workshop/graphic/3D_fileformat/lightwave/h_lwlo.html
[Fisher & Gruber 2005] Fisher, K. & Gruber, R. (2005). PADS: A domain specific language for
processing ad hoc data. ACM Conference on Programming language Design and
Implementation. ACM Press. http://www.padsproj.org/papers/pldi.pdf
[Foo 1995] Kenneth Foo, Audio Manager 4 Module Format. TechnoMaestro/RDG, 1995
46
[Frost 1997a] Martin Frost. Z-machine Common Save-File Format Standard also called Quetzal:
Quetzal Unifies Efficiently The Z-Machine Archive Language version 1.3b 29th September
1997. www.gnelson.demon.co.uk/zspec/quetzal.html
[Frost 1997b] Martin Frost. Z-machine Common Save-File Format Standard. also called
Quetzal: Quetzal Unifies Efficiently The Z-Machine Archive Language version 1.4 (03-Nov-97)
http://ifarchive.heanet.ie/if-archive/infocom/interpreters/specification/savefile_14.txt
[Glanville 2003] R. S. Glanville. Format of Anim8or .an8 Files - V0.85, 21-Sep-03.
http://www.anim8or.com/resources/an8_format.txt
[Google 2010] Google. WebP RIFF Container, 2010.
http://code.google.com/speed/webp/docs/riff_container.html
[Haaland] Mike Haaland. Flic Files (.FLI) Format description.
http://www.martinreddy.net/gfx/anim/FLI.txt
[Hamilton 1992] Eric Hamilton. JPEG File Interchange Format: Version 1.02., C-Cube
Microsystems; Milpitas, CA; September 1, 1992. www.jpeg.org/public/jfif.pdf
[Harvard University Library 2011] JHOVE Documentation.
http://jhove.sourceforge.net/documentation.html
[Hayes and Morrison 1985] Steve Hayes and Jerry Morrison, "8SVX" IFF 8-Bit Sampled Voice.
Electronic Arts, February 7, 1985. http://amigan.1emu.net/reg/8SVX.txt
[Hopcroft et al 1995] J.E. Hopcroft, R. Motwani, and J.D. Ullman. Introduction to Automata
Theory, Languages and Computation, Second Edition. Addison Wesley, ISBN: 0-201-44124-1,
1995.
[Houghtaling 1996] R. James Houghtaling. ANI (Windows95 Animated Cursor File Format),
August 1996. www.wotsit.org/list.asp?search=ani&button=GO%21
[IBM & Microsoft 1991] IBM and Microsoft. Multimedia Programming Interface and Data
Specifications 1.0, August 1991. www.kk.iij4u.or.jp/~kondo/wave/mpidata.txt
www.tactilemedia.com/info/MCI_Control_Info.html
http://www-mmsp.ece.mcgill.ca/documents/audioformats/wave/Docs/riffmci.pdf
[Int. MIDI Assoc 2008] The International MIDI Association. Standard MIDI-File Format Spec.
1.1, 2008 http://duskblue.org/proj/toymidi/midiformat.pdf
[ISO/IEC JTC 1/SC 29/WG 1] JPEG 2000 Part I : Core Coding System, Final Committee Draft
Version 1.0, March, 2000. http://www.jpeg.org/public/fcd15444-1.pdf
47
[ISO/IEC 10918-1:1994] Information technology -- Digital compression and coding of
continuous-tone still images-Requirements and Guidelines.
www.iso.org/iso/iso_catalogue/catalogue_tc/catalogue_detail.htm?csnumber=18902
[ISO/IEC IS 10918-3] Digital Compression and Coding of Continuous-Tone Still Images -
Extensions. Also ITU-T T.84 Annex F www.jpeg.org/public/spiff.pdf
[ISO/IEC 13818-7:2006] Information technology -- Generic coding of moving pictures and
associated audio information -- Part 7: Advanced Audio Coding (AAC)
[ISO/IEC 14495-1:1999] Information technology -- Lossless and near-lossless compression of
continuous-tone still images: Baseline, December 1, 1999.
[ISO/IEC 14495-2:2003] Information technology -- Lossless and near-lossless compression of
continuous-tone still images: Extensions, April 2, 2003
[ISO/IEC 14496-3:2009] - Information technology -- Coding of audio-visual objects -- Part 3:
Audio http://webstore.iec.ch/preview/info_isoiec14496-3%7Bed4.0%7Den.pdf
[ISO/IEC 14496-12:2012]. Information technology -- Coding of audio-visual objects -- Part 12:
ISO base media file format (technically identical to ISO/IEC 15444-12:2012).
http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
[ISO/IEC 14496-14:2003] Information technology -- Coding of audio-visual objects -- Part 14:
MP4 file format.
[ISO/IEC 15444-1:2004] Information technology -- JPEG 2000 image coding system -- Part 1:
Core Coding System. Annex I JP2 File Format syntax Sept 15 2004.
www.jpeg.org/jpeg2000/CDs15444.html
[ISO/IEC 15444-2:2000] Information technology -- JPEG 2000 image coding system -- Part 2:
Extensions, Dec 7 2000, Final Committee Draft (FCD) version available as
www.jpeg.org/public/fcd15444-2.pdf
[ISO/IEC 15444-12:2012] Information technology -- JPEG 2000 image coding system - ISO
Base Media Format. http://standards.iso.org/ittf/PubliclyAvailableStandards/index.html
[ISO/IEC 15948:2003] Information technology — Computer graphics and image processing —
Portable Network Graphics (PNG): Functional specification. W3C Recommendation 10
November 2003 www.w3.org/TR/2003/REC-PNG-20031110/
[ITU 1996] T-REC-T.84. Information technology -- Digital compression and coding of
continuous-tone still images: Extensions, 1996. www.jpeg.org/public/spiff.pdf
48
[Jasc 1998] Jasc Software . Paint Shop Pro (PSP) File Format Specification, Jasc Software,
November, 1998 (Paint Shop Pro (PSP) File Format Version 3.0, Paint Shop Pro Application
Version 5.0) www.jasc.com/support/kb/articles/pspspec.asp
[JEIDA 1998] Japan Electronic Industry Development Association. Digital Still Camera Image
File Format Standard (Exchangeable image file format for Digital Still Camera: Exif), Version
2.1, JEIDA-49-1998, December 1998 www.exif.org/dcf-exif.PDF
[JEITA 2002] Japan Electronic Information Technology Industries Association. Exchangeable
image file format for Digital Still Camera: Exif Version 2.2. JEITA CP-3451. April 2002.
www.exif.org/Exif2-2.PDF
www.kodak.com/global/plugins/acrobat/en/service/digCam/exifStandard2.pdf
[Kirvan 1994] S. Kirvan. Imagine 3.0 Object Format, Impulse Inc., May 1994.
http://www.textfiles.com/programming/FORMATS/t3d_doc1.txt
[Knuth 1968] D. E. Knuth. Semantics of Context-Free Grammars. Mathematical Systems Theory,
vol. 2, pp. 127-145, 1968.
[Knuth 1971] D. E. Knuth. Semantics of Context-Free Grammars (corrections). Mathematical
Systems Theory, vol. 5, pp. 95_96, 1971.
[Koster 1971] C.H.A. Koster. Affix grammars. In ALGOL 68 Implementation, J.E.L. Peck (ed.)
North-Holland Publ. Co., Amsterdam, 1971, pp. 95-109.
[Kuznetkov 2001] Introduction to Advanced Streaming Format version 1.0, January, 2001.
http://avifile.sourceforge.net/asf-1.0.htm
[Larmouth 1999] J. Larmouth. ASN.1 Complete. Morgan Kaufmann.
1999.www.oss.com/asn1/resources/books-whitepapers-pubs/larmouth-asn1-book.pdf
[Lemone 1993] Karen Lemone, Design of Compilers: techniques of Programming Language
Translation, CRC Press, 1993. http://web.cs.wpi.edu/~kal/courses/compilers/project/index.html
http://web.cs.wpi.edu/~kal/courses/compilers/module4/mysa.html
[LightWave 1996] LightWave 3D Object File Format Oct 16, 1996
http://sandbox.de/osg/lightwave.htm
[Lightwave 2001a] Lightwave. Lightwave 3D Object Files (LWo2], November 9, 2001
http://muskrat.middlebury.edu/lt/cr/ongoing/sciviz/LightWave/SDK/docs/filefmts/lwsc.html
[Losch 1997] Philip Losch. CINEMA 4D® — File Format Description, MAXON Computer,
1997. http://paulbourke.net/dataformats/cinema4d/cinema4d.pdf
49
[Luxology 2006] Luxology. “Appendix H: LXO file format Extensions” modo 202 Scripting and
Commands. 2006 www.kxcad.net/modo/Scripting_and_Commands.pdf
[Maishein] Max Maischein. PIC(PRO) Format.
http://147.91.177.212/extra/fileformat/graphics/pic/picpro.txt
[Martin 2007] Michael Martin. SVAF format description, Version 0.2, January 2007.
http://svaf.sourceforge.net/svaf.html
[Microsoft 1992] Microsoft Multimedia Standards Update, New Multimedia Data Types and
Data Techniques July 29, 1992 Revision: 1.0.97
http://netghost.narod.ru/gff/vendspec/micriff/ms_riff.txt
[Microsoft 1994] Microsoft Corporation. Microsoft Multimedia Standards Update: New
Multimedia Data Types and Data Techniques., Revision 3.0, April 15, 1994.
http://download.microsoft.com/download/9/8/6/9863C72A-A3AA-4DDB-B1BA-
CA8D17EFD2D4/RIFFNEW.pdf
[Microsoft & RealNetwork 1998] Microsoft Corporation and RealNetworks, Inc. Advanced
Streaming Format (ASF) Specification. Public Specification Version 1.0, February 26, 1998.
http://avifile.sourceforge.net/asf0298rtf.htm http://avifile.sourceforge.net/asf0298rtf.zip
[Microsoft 2003b] Microsoft Corporation. Multiple Channel Audio Data and WAVE Files
http://www-mmsp.ece.mcgill.ca/Documents/AudioFormats/WAVE/Docs/multichaudP.pdf
[Microsoft 2004] Microsoft. Advanced Systems Format (ASF) Specification, Revision 01.20.03,
December 2004. www.microsoft.com/windows/windowsmedia/forpros/format/asfspec.aspx
www.microsoft.com/en-us/download/details.aspx?displaylang=en&id=14995
[Microsoft 2005] Microsoft, AVI RIFF File Reference, 2005 http://msdn.microsoft.com/en-
us/library/ms899421.aspx
[Morrison 1985] Jerry Morrison. “EA IFF 85” Standard for Interchange Format Files,
Electronic Arts, January 14, 1985 www.martinreddy.net/gfx/2d/IFF.txt
[Morrison 1986a] Jerry Morrison. "ILBM" IFF Interleaved Bitmap , Electronic Arts, January 17,
1986 www.fine-view.com/jp/labs/doc/ilbm.txt
[Morrison 1986b] J. Morrison. “SMUS" IFF Simple Musical Score”, Electronic Arts, Inc.,
February 5, 1986. http://amigan.1emu.net/reg/SMUS.txt
[MultimediaWiki 2006] Smush Animation Format, August 2006
http://wiki.multimedia.cx/index.php?title=SAN#Chunk_Forma
50
[MultimeduiaWiki 2008a] Electronic Arts CMV, April 2008
http://wiki.multimedia.cx/index.php?title=Electronic_Arts_CMV
[MultimediaWiki 2008b] Electronic Arts DCT, April 2008
http://wiki.multimedia.cx/index.php?title=Electronic_Arts_DCT
[MultimediaWiki 2008c] Electronic Arts MAD, April 2008
http://wiki.multimedia.cx/index.php?title=Electronic_Arts_MAD
[MultimediaWiki 2008d] Electronic Arts MPC, April 2008
http://wiki.multimedia.cx/index.php?title=Electronic_Arts_MPC
[MultimediaWiki 2008e] Electronic Arts VP6, April 2008
http://wiki.multimedia.cx/index.php?title=Electronic_Arts_VP6
[MultimediaWiki 2008f] Electronic Arts TGV, August 2008
http://wiki.multimedia.cx/index.php?title=Electronic_Arts_TGV
[MultimediaWiki 2008g] Electronic Arts TGQ, November 2008
http://wiki.multimedia.cx/index.php?title=Electronic_Arts_TGQ
[MultimediaWiki 2008h] American Laser Games MM, January 2008
http://wiki.multimedia.cx/index.php?title=American_Laser_Games_MM
[MultimediaWiki 2009a] Electronic Arts TQI, February 2009
http://wiki.multimedia.cx/index.php?title=Electronic_Arts_TQI
[MultimediaWiki 2009b] SANM, October 2009.
http://wiki.multimedia.cx/index.php?title=SANM
[Novikov and Fillipov] Igor E. Novikov and Valek Fillipov. Cdrloader.py, A cdr import filter
written in Python in the open source graphics file convertor, UniConvertor.
http://sk1project.org/modules.php?name=Products&product=cdrexplorer
[OGF 2011] OGF DFDL WG. Data Format Description Language (DFDL) v1.0 Specification,
January 31, 2011. www.ogf.org/documents/GFD.174.pdf
[Oppenheim 1988] Dave Oppenheim. Standard MIDI Files 0.06, March 1, 1988
www.ibiblio.org/emusic-l/info-docs-FAQs/MIDI-doc/MIDI-SMF.txt
[Parr & Quong 1995] T. J. Parr & R. W. Quong. ANTLR: A predicated- ll(k) parser generator.
Software: Practice and Experience. 25, 7 (July 1995), 789-810.
http://www.antlr.org/article/1055550346383/antlr.pdf
51
[Parr 2007] T. Parr. The Definitive ANTLR Reference: Building Domain-Specific Languages.
Pragmatic, 2007
[Parr 2012] T. Parr. The Definitive ANTLR4 Reference. Pragmatic, 2012.
[Peters] Christoph Peters. The Ultimate 3D model file format (v. 2.0)
http://ultimate3d.org/U3DModelFileFormatV2.htm
[Pfeiffer 2003] S. Pfeiffer. The Ogg Encapsulation Format Version 0, RFC 3533, May 2003.
http://xiph.org/ogg/doc/rfc3533.txt
[Plotkin] Andrew Plotkin . Blorb: An IF Resource Collection Format Standard, Version 2.0
www.eblong.com/zarf/blorb/
www.ifarchive.org/if-archive/infocom/interpreters/specification/blorb_format.txt
[Randelshofer 2011] Werner Randelshofer. Class RIFFParser, January 6, 2011
www.randelshofer.ch/multishow/javadoc/ch/randelshofer/media/riff/RIFFParser.html
[Randers-Pehrson 2001] Glenn Randers-Pehrson (ed.) MNG (Multiple-image Network Graphics)
Format Version 1.0, 2001 www.libpng.org/pub/mng/spec/#Credits
[Rentz 2008] Daniel Rentz. OpenOffice.org's Documentation of the Microsoft Excel File
Format, Excel Versions 2, 3, 4, 5, 95, 97, 2000, XP, 2003, OpenOffice.org, April 2008.
http://sc.openoffice.org/excelfileformat.pdf
[REWiki 2009]reWiki. .IFF. 2009 http://rewiki.regengedanken.de/wiki/.IFF
[RFC 3072] M. Wildgrube. Structured Data Exchange Format (SDXF) Request for Comments:
3072, March 2001 www.ietf.org/rfc/rfc3072.txt
[Shaw et al 1985] Steve Shaw, Jerry Morrison and Bob Burns, FTXT: IFF Formatted Text, Draft
2.6, Electronic Arts. November 15, 1985 http://fortunaty.net/com/textfiles/computers/ftxt
[Soras 1995] L. de Soras. Official formats of the Graoumf Tracker modules .GT2 v1 - v3 and
old GT modules GTK, 31/08/95. http://www.wotsit.org/list.asp?search=GTK&button=GO%21
[Sparta 1988] Sparta, Inc., ANIM: An IFF Format For CEL Animations, 4 May 1988
http://amigan.1emu.net/reg/ANIM.txt
[Tamir 2005] Giora Tamir. C# RIFF Parser, June 2005
www.codeproject.com/KB/files/riffparser.aspx
52
[Thirunarayan 2008] Krishnaprasad Thirunarayan. Attribute Grammars and their Applications.
In: Encyclopedia of Information Science and Technology, Second Edition, Editor: Mehdi
Khosrow-Pour, Information Science Publishing, pp. 268-273, 2008.
www.cs.wright.edu/~tkprasad/papers/Attribute-Grammars.pdf
[Underwood 2009] W. Underwood. Extensions of the UNIX file Command and Magic File for
File Type Identification. PERPOS TR ITTL/CSITD 09-02 Georgia Tech Research Institute,
September 2009.
[Ung 1996] David Ung. Retargetable loader, Honours thesis, Computer Science and Electrical
Engineering, The University of Queensland, 1996
http://itee.uq.edu.au/~cristina/students/david/david.html
[Walkin 2006] L. Walkin. Using the Open Source ANS.1 Compiler. 2006
http://lionet.info/asn1c/asn1c-usage.pdf
[van Wijngaarden 1974] A. van Wijngaarden. The generative power of two-level grammars.
Automata, Languages and Programming. Lecture Notes in Computer Science #14, J. Loeckx (e.)
Springer-Verlag, Berlin, 1974, pp. 9-16.
[Wikipedia 2012] Wikipedia. DivX http://en.wikipedia.org/wiki/DivX
[Wilhelm and Mauer 1995] R. Wilhelm and D. Maurer. Compiler Design. Addison-Wesley,
1995. (chapter 10) (Attribute Grammars)
[Xiph 2010] Xiph.Org Foundation. Vorbis I specification, February 3, 2010
http://xiph.org/vorbis/doc/Vorbis_I_spec.html
[Zappe 1994] H. Zappe. Oktalyzer, 1994.
http://www.wotsit.org/list.asp?search=okt&button=GO%21
53
Appendix A: Binary Array Grammars for Chunk-based Binary File
Formats
This appendix sketches binary array grammars for some of binary chunk-based file formats.
A.1 Electronic Arts Interchange File Format (IFF)
Electronic Arts and Commodore-Amiga introduced the first chunk-based file format in the
specification for the Interchange File Format in 1985[Morrison 1985, Amiga 2011].
A.1.1 Interleaved Bitmap (ILBM)
A binary array attribute grammar for this file format was presented in section 4.1 of this report.
[Morrison 1986]
A.1.2 8-bit Sampled Voice (8SVX)
The file format specification for the 9-bit sampled voice file format uses C programming
language data structures to define the format [Hayes and Morrison 1985]. The following is a re-
expression of that specification as a context-free binary array grammar. It does not yet include
synthesized and inherited attributes.
<8SVX> → “FORM” <cksize INT32> “8SVX” <ckdata>
<ckdata> → <voice8header> <bodyck>
<ckdata> → <voice8header> <nameck> <copyrightck>< bodyck>
<ckdata> → <voice8header> <copyrightck> <nameck>< bodyck>
<ckdata> → <voice8header><nameck>< bodyck>
<ckdata> → <voice8header> <copyrightck>< bodyck>
<ckdata> → <voice8header> <authorck>< bodyck>
<ckdata> →< voice8header> <annotation>< bodyck>
<ckdata> → <voice8header> <atakck>< releaseck>< bodyck>
<nameck> → “NAME” <cksize INT32 kk> CHAR[kk]
<copyrightck> → ‘(c) ‘ <cksize INT32 ll> CHAR[ll]
<authorck> → “AUTH”<cksize INT32 mm>CHAR[mm]
<annotation> → <annotationck> <annotation>
<annotation> → <nnotationck>
<annotationck> → “ANNO” <cksize INT32 i> CHAR[i]
<atakck>→ “ATAK” <cksize INT32 nn><EGpoint>[nn]
<releaseck> → “RLSE” <cksize INT32 m> <EGpoint>[m]
<EGpoint> → <duration UWORD>< dest FIXED>
<bodyck> → “BODY” <cksize INT32 n> samples[n]
54
A.2 Standard Midi File Format
Oppenheim [1988] described the midi file format using the pseudo EBNF notation shown below
[Int. MIDI Assoc 2008].
<Midi> → <headerchunk> <trackchunks>
<headerchunk> → “MThd” <length of header data> <header data>
<header data> → <format> <ntrks> <division>
<trackchunks> → <trackchunk> <trackchunks>
<trackchunk> → “MTrk” <length of track data> <track data>
<track data> → <MTrk event>+
<MTrk event> → <delta-time> <event>
<event> → <MIDI event> | <sysex event> | <meta-event>
<MIDI event> → <MIDI channel message>
<sysex event> → 0xF0 <length> <bytes to be transmitted after F0>
<sysex event> → 0xF7 <length> <all bytes to be transmitted>
<meta-event> → 0xFF <type> <length> <bytes>
The “length of header data” is a 32-bit big-endian integer with value 6. Header data consists of
three 16-bit integers. The pseudo EBNF can be more precisely specified with the following
context -free binary array grammar. This grammar does not yet include synthesized and inherited
attributes
<Midi> → <headerchunk> <trackchunks>
<headerchunk> → “MThd” <headerdatalength BEINT32 6> <headerdata>
<headerdata> → <format BEINT16> <ntrks BEINT16> <division BEINT16>
<format BEINT16> → 1 | 2| 3
<trackchunks> → <trackchunk> <trackchunks>
<trackchunk> → “MTrk” <trackdatalength BEINT32> <trackdata>
<trackdata> → <MTrkevent>+
<MTrkevent> → <delta-time VARLENQUAN> <event>
<event> → <MIDI event> | <sysex event> | <meta-event>
<MIDI event> → <MIDI channel message>
<sysex event> → 0xF0 <length VARLENQUAN k> <bytes to be transmitted
after F0>[k]
<sysex event> → 0xF7 <length VARLENQUAN m> <all bytes to be
transmitted>[m]
<meta-event> → 0xFF <type BYTE> <length VARENQUAN n> <bytes>[n]
The datatype VARLENQUAN is a Variable –Length Quantity. “These numbers are represented
7 bits per byte, most significant bits first. All bytes except the last have bit 7 set, and the last byte
has bit 7 clear.”
A.3 Audio Interchange File Format
The Audio Interchange File Format (AIFF) developed by Apple Computer in 1988 was based on
Electronic Arts’ Interchange File Format. The file format is specified using C data structure
55
definitions [Apple 1989]. The following grammar recasts the specification as a context-free
pseudo EBNF grammar. It does not yet include the data types per a binary array grammar or
synthesized and inherited attributes.
<AIFF> -> “FORM”<cksize INT32>“AIFF”<localchunks>
<localhunks> -> <CommonChunk><SoundDataChunk>
<CommonChunk> -> COMM”<cksize INT32 18><numChannels INT16>
<numSampleFrame UINT32><sampleSize INT16>
<sampleRate FLOAT80>
<SoundDataChunk> -> “SSND”<cksize INT32 k><offset UINT32>
<blockSize UINT32><soundData UBYTE>[k-8]
<MarkerChunk> -> “MARK”<cksize INT32 mm><Markers>[mm]
<Markers> -> <Marker><Markers>
<Marker> -> <markerid UINT16><position UINT32><markerName PSTRING>
<InstrumentChunk> -> “INST”<cksize INT32>
<baseNote CHAR>
<detune CHAR>
<lowNote CHAR>
<highNote CHAR>
<lowVelocity CHAR>
<highVelocity CHAR>
<gain INT16>
<sustainLoop> <Loop>
<releaseLoop> <Loop>
<Loop> -> <playmode UINT16><beginloop markerid><endloop markerid>
<playmode UINT16> -> 0 | 1 | 2
<MIDIDataChunk> -> “MIDI”<cksize INT32 nn><MIDIData>[nn]
<AudioRecordingChunk> -> “AESD”<cksize INT32><AESChannelStatusData>[24]
<ApplicationSpecificChunk> -> “APPL”<cksize INT32 kk>
<applicationSignature CHAR{4]>
<data pstring>[kk-4]
<CommentsChunk> -> “COMT”<cksize INT32 i>
<numcomment UINT16><comments>[i-4]
<Comment> -><timestamp UINT32>
<marker markerid>
<count UINT16 j>
<text CHAR>[i]
56
A.4 Resource Interchange File Formats
The Resource Interchange File Format (RIFF) introduced by IBM and Microsoft[1991] is also
based on Electronic Arts’ Interchange File Format. The Microsoft container formats like Audio-
Video Interleave (AVI) and Waveform PCM (WAV) use RIFF as their basis [Microsoft 1992]
[Randelshofer 2011]. WebP, a picture format introduced in (2010) by Google, also uses RIFF as
a container.
A.4.1 RIFF-Waveform (WAVE)
This file format was discussed in section 6.3 of this report [IBM & Microsoft 1991].
A.4.2 WAVEFORMATEX
The WAVEFORMATEX type must contain both a fact chunk and an extended wave format
description within the 'fmt' chunk [Microsoft 1994]. The fact chunk specifies the length of the
file in samples. Wave_Format-PCM has value 0x0001 for wFormat Tag. WAVEFORMATEX
has different values for wFormat Tag. It also adds cbsize, the count in bytes of the extra size to
the common fields.
<common-fields> -> <wFormatTag UINT16>
<wChannels UINT16>
<nSamplesPerSec UINT32>
<nAvgBytesPerSec UINT32>
<nBlockAlign UINT16>
<wBitsPerSample UINT16>
<cbsize UINT16 m>
<extradata>[m]
A.5 JPEG
The Joint Photographic Experts Group (JPEG) group issued the first JPEG standard in 1992 as
ITU-T Recommendation T.81 [CCITT 1992] and in 1994 as [ISO/IEC 10918-1:1994]
The JPEG standard specifies the codec, which defines how an image is compressed into a stream
of bytes and decompressed back into an image, but not the file format used to contain that
stream. The Exif and JFIF standards define the commonly used file formats for interchange of
JPEG-compressed images.
A.5.1 JPEG File Interchange Format (JFIF)
A JPEG image consists of a sequence of segments, each beginning with a marker, each of which
begins with a 0xFF byte followed by a byte indicating what kind of marker it is. Some markers
consist of just those two bytes; others are followed by two bytes indicating the length of marker-
57
data that follows. The length includes the two bytes for the length, but not the two bytes for the
marker. [Hamilton 1992, Cuturicu and Fromme 1999, Ecma 2009]
APP0 is the Application Segment for JFIF.
<JFIF> → <SOI><APP0>< segments> <EOI>
<APP0> → 0xFFE0 <segsize UINT16> “JFIF” <NUL><majversion ubyte>><minversion
ubyte><densityunit ubyte><Xdensity UINT16><Ydensity UINT16><thumbnailwidth ubyte
tw><thumbnailheight ubyte th><thumbnail_data 24-bits> [tw x th]
<segments> → <segment> <segments>
<segment> → <marker> <segsize UINT16 n> <data>[n-2]
<marker> → 0xFF <type BYTE>
<SOI> → 0xFFD8
<EOI> → 0xFFD9
<NUL> → 0x00
A.5.2 JPEG/Exif
APP1 is the Application Segment for EXIF for JPEG files. [JEIDA 1998, JEITA 2002] Attribute
Information for JPEG compressed files is stored in the Tag information format of TIFF 6.0.
<Exif> → <SOI><APP1>[APP2>]<DQT><DHT><SOF><SOS> <Compressed_Data><EOI>
<APP1> → 0xFFE1 <segsize UINT16> “Exif” <NUL><NUL><TIFF_Header>
<TIFF_Header> →”II”|”MM”)(0x2A<Exif_IFD-Pointer BEINT32 8><Exif_IFD>
<Exif_IFD> → <ExifVersion> | <ImageWidth> | <Image_Length> | <BitsPerSample> |
<Compression>
<ExifVersion> → <Tag UINT16 0x9000><Type UINT16 0x07><Count UINT32 4>”0201”
<APP2> → 0xFFE2<segsize UINT16>”FPXR”<NUL><Version BYTE 0x00><Contents>
<DQT> → 0xFFDB<segsize UINT16 197>
<Y byte 0x00><QTY BYTE>[64]
<Cb BYTE 0x01><QTCb BYTE[64]
<Cr BYTEv 0x02><QTCr BYTE>[64]
<DHT> → 0xFFC4<segsize UINT16 0x01A2>data[418]
<SOF> → 0xFFC0<segsize UINT16 17>
<Precision UBYTE x08>
<Vert_Lines><Horiz_Lines> <Components>
<Component_No><H0_V0><Quantization_Designation>
<Component_No><H1_V1><Quantization_Designation>
<Component_No><H2_V2><Quantization_Designation>
58
<SOS> → 0xFFDA<segsize UINT16 12><Components_in_Scan BYTE 0x03>
<Component_Selector_Y><Huffman_Table_Selector_Y>
<Component_Selector_Cb><Huffman_Table_Selector_C>
<Component_Selector_Cr><Huffman_Table_Selector_C>
<Scan_Start_position><Scan_end_position>
<Successive_approximation_Bit_position>
<SOI> → 0xFFD8
<EOI> → 0xFFD9
<NUL> → 0x00
A.5.3 SPIFF (Still Picture Interchange File Format)
SPIFF is a JPEG file format that is intended to replace the defacto JFIF (JPEG File Interchange
Format). SPIFF includes all of the features of JFIF and adds more functionality. The SPIFF file
format is specified in annex F of ITU-T Recommendation T.84. [ITU 1998]
<SPIFF> → <SOI><APP8>< DirectoryEntries><Image_Data> <EOI>
<APP8> → 0xFFE8 <segsize UINT16 32> “SPIFF” <NUL>
<VERS INT16 0x0100>
<P INT8>
<NC INT8>
<Height INT32>
<Width INT32>
<S INT8>
<BPS INT8>
<C INT8>
<R INT8>
<VRES INT32>
<HRES INT32>
<DirectoryEntries> → <Entry> <DirectoryEntries>
<DirectoryEntries> → <EOD>
<EOD> 0xFFE8<EODLength UINT16 8><EODTag INT32 1>
<Entry> → 0xFFE8<EntryLength UINT16 n><ETag INT32> <data>[n-6]
<SOI> → 0xFFD8
<EOI> → 0xFFD9
<NUL> → 0x00
A.6 Advanced Systems Format
Microsoft’s Advanced Systems Format (ASF) version 1 [Kuznetkov 2001] is also a chunk-based
container format. It was developed about 1995-96, but its specification was never published.
59
Microsoft’s Advanced System Format version 2 (Public Version 1) was published in 1998
[Microsoft & RealNetwork 1998] with a revision in 2004 [Microsoft 2004].
The format does not specify with which codec the video or audio should be encoded. It just
specifies the structure of the video/audio stream. The most common file formats (codecs?)
contained within an ASF file are Windows Media Audio (WMA) and Windows Media Video
(WMV).
<ASF> → <HeaderObject><DataObject>
<HeaderObject>→<Header_Object_ID GUID 75B22630-668E-11CF-A6D9-00AA0062CE6C>
<Size LEINT64>
<NumHeaderObjects LEINT32>
<Reserved1 BYTE 0x01>
<Reserved2 BYTE 0x02>
<FilePropertiesObject>
<HeaderExtensionObject>
<StreamPropertiesObject>
<FilePropertiesObject> → <GUID 8CABDCA1-A947-11CF-8EE4-00C00C205365>
<Size LENT64>
<FileID GUID>
<FileSize LEINT64>
<CreationDate LEINT64>
<DataPacketsCount LEINT64 m>
<PlayDuration LEINT64>
<SendDuration LEINT64>
<Preroll LEINT64>
<Flags Bytes[4]>
<MinDataPacketSize LEInt32>
<MaxDataPacketSize LEINT32>
<MaxBitRate LEINT32>
<StreamPropertiesObject> → <ID GUID B7DC0791-A9B7-11CF-8EE6-00C00C205365>
<Object_Size LEUINT64>
<Stream_Type GUID>
<Error_Correction _Type GUID >
<Time_Offset LEUINT64>
<Type-Specific_Data_Length LEUint32 m>
<Error_Correction_Data_Length LEUINT32 n>
<Flags BYTE[2]>
60
<Reserved LEUIN32 0>
<Type-Specific_Data BYTE>[m]
<Error_Correction _Data BYTE>[n]
<HeaderExtensionObject> → <ID GUID 5FBF03B5-A92E-11CF-8EE3-00C00C205365>
<Object_Size LEUINT64>
<Reserved_Field1 GUID ABD3D211-A9BA-11cf-8EE6-00C00C205365>
<Reserved_Field2 LEUINT16 6>
<Header_Extension_Data_Size LEUINT32 n>
<Header_Extension_Data BYTE>[n]
<DataObject> → <Object_ID GUID 75B22636-668E-11CF-A6D9-00AA0062CE6C >
<Object_Size LEUINT64>
<File_ID GUID >
<Total_Data_Packets LEUINT64 m>
<Reserved WORD 0x0101>
<DataPacket>[m]
A.6.1 Windows Media Audio 7-9
When the Stream Type of the Stream Properties Object has the value ASF_Audio_Media =
F8699E40-5B4D-11CF-A8FD-00805F5C442B, the ASF audio media type that populates the
Type-Specific Data field of the Stream Properties Object is represented using the following
structure.
<Type-Specific_Data BYTE>[m] → <CodecID LEWORD>
<nChannels WORD>
<nSamplesPerSec LEINT32>
<nAvgBytesPerSec LEINT32>
<nBlockAlign WORD>
<wBitsperSample LEINT16>
<CbSize LEINT16 n>
<Codec Specific Data BYTE>[n]
The <CodecID> specifies the unique ID (FormatTag) of the codec used to encode the audio data.
The following values indicate codecs for Windows Media Audio 7-9.
Codec name Codec ID/ Notes
61
Format tag
Windows Media
Audio
0x161 Versions 7, 8, and 9 Series
Windows Media
Audio 9 Professional
0x162 9 Series
Windows Media
Audio 9 Lossless
0x163 9 Series
A.6.2 Windows Media Video 7-9.1
When the Stream Type of the Stream Properties Object has the value ASF_Video_Media =
BC19EFC0-5B4D-11CF-A8FD-00805F5C442B, the ASF video media type that populates the
Type-Specific Data field of the Stream Properties Object is represented using the following
structure. [Microsoft 2004]
<Type-SpecificData> → <EncodedImageWidth LEINT32>
<EncodedImageHeight LEINT32>
<ReservedFlsgs INT16 2>
<FormatDataSize LEUINT32 n
<FormatData>
<FormatData> → <FormatDataSize LEINT32 m>
<ImageWidth LEINT32>
<ImageHeight LEINT32>
<Reserved LEINT16 1>
<nBitsPerPixel LEINT16>
<CompressionID CHAR[4]>
<ImageSize LEINT32>
<HorizontalPixelsPerMeter LEINT32>
<VerticalPixelsPerMeter LEINT32>
<nColorsUsed LEINT32>
<nImportantColors LEINT32>
<CodecSpecificData BYTE>
The following values of CompressionID indicate codecs for Windows Media Video 7-9.
CompressionID Compression Name
WMV1 Windows Media Video 7
WMV2 Windows Media Video 8
62
WMV3 Windows Media Video 9
A.7 Portable Network Graphics
[ISO/IEC 15948:2003]
<PNG> -> <PNGSignature><headerck><paletteck><imagedata><trailer>
<PNGSignature> → 0x89”PNG”0xODOA1A0A
<headerck> -> <cksize UINT32>“IHDR”
<Width UINT32>
<Height UINT32>
<BitDepth UINT8>
<ColourType UINT8>
<Compressionmethod UINT8>
<Filtermethod UINT8>
<Interlacemethod UINT8>
<CRC DWORD>
<paletteck> -> <cksize UINT32>“PLTE”<
<entries><CRC DWORD>
<entries> -> <entry><entries>
<entry> -> <red BYTE>
<green BYTE>
<blue BYTE>
<imagedata> → <cksize UINT32 n>”IDAT”<data BYTE>[n]
<trailer> → cksize UINT32>”IEND”<CRC DWORD>
A.8 Apple QuickTime Movie
[Apple 2007]
<QTFF> → <file type atom><movie atom>
<file type atom> → <atomsize UINT32><type string[4]=”ftyp”>
<major_brand string[4]=”qt “> <minor_version 0x20050300>
<compatible_brands string[4]>[atomsize-16]
<movie atom> → <atomsize UINT32><type string[4]=”moov”>
<movie header atom>
<track atom>
<movie header atom> → <atom size UINT32><type string[4]=”mvhd”>
<version BYTE>
<flags BYTE[3]>
<creation time UINT32>
63
<modification time UINT32>
<time scale UINT32>
<Duration UINT32>
<preferred rate UINT32>
<preferred volume UINT16>
<reserved BYTE[10]=0x00>
<matrix structure UINT32[9]>
<preview time UINT32 >
<preview duration UINT32 >
<poster time UINT32>
<selection time UINT32>
<selection duration UINT32>
<current time UINT32>
<next track id UINT32>
<track atom> → <atom size UINT32><type string[4]=”trak”>
<track header atom>
<media atom>
<track header atom> → <atom size UINT32> <type string[4]=”tkhd”>
<version BYTE>
<flags BYTE[3]>
<creation time UINT32>
<modification time UINT32>
<track ID UINT32>
<reserved UINT32=0>
<duration UINT32>
<reserved BYTE[8]=0x00>
<layer UINT16>
<alternate group UIN16>
<volume UINT16>
<reserved UINT16=0>
<matrix structure INT32[9]>
<track width UINT32>
<track height INT32>
<media atom> → <atom size UINT32><type string[4]=”mdia”
<media header atom>
<media header atom> → <atom size UINT32><type string[4]=”mdhd”
64
<version BYTE>
<flags BYTE[3]>
<creation time UINT32>
<modification time UINT32>
<time scale UINT32>
<duration UI32>
<language UINT16>
<quality UINT16>
A.9 ESRI Shapefile
[ESRI 1998]
<shapefile>→ <filecode BEUNT32=9994>
<unused INT32>[5]
<filelength BEINT32>
<version LEINT32=1000>
<shapetype LEINT32>
<Xmin LEFP64>
<Ymin LEFP64>
<Xmax LEFP64>
<Ymax LEFP64>
<Zmin LEFP64>
<Zmax LEFP64>
<Mmin LEFP64>
<MmaxLEFP64>
<records>
<records> → <record><records> | <record>
<record> → <recordnumber BEINT32>
<contentlength BEINT32>
<recordcontent>
<recordcontent> → <NullShapeRecord> |
<PointRecord> |
<MultiPointRecord> |
<PolyLineRecord> |
<PolygonRecord> |
<PointMRecord> |
<MultiPointMRecord> |
<PolyLineMRecord> |
<PolygonMRecord> |
<PointZRecord> |
65
<MultiPointZRecord> |
<PolyLineZRecord> |
<PolygonZRecord> |
<MultiPatchRecord>
<NullShapeRecord> → <shapetype LEINT32=0>
<PointRecord> → <shapetype LEINT32=1><Point>
<MultiPointRecord> → <shapetype LEINT32=8><MultiPoint>
<PolyLineRecord> → <shapetype LEINT32=3><PolyLine>
<PolygonRecord> → <shapetype LEINT32=5><Polygon>
<PointMRecord> → <shapetype LEINT32=21><PointM>
<MultiPointMRecord> → <shapetype LEINT32=28><MultiPointM>
<PolyLineMRecord> → <shapetype LEINT32=23><PolyLineM>
<PolygonMRecord> → <shapetype LEINT32=25><PolygonM>
<PointZRecord> → <shapetype LEINT32=11><PointZ>
<MultiPointZRecord> → <shapetype LEINT32=18><MultiPointZ>
<PolyLineZRecord> → <shapetype LEINT32=13><PolyLineZ>
<PolygonZRecord> → <shapetype LEINT32=15><PolygonZ>
<MultiPatchRecord> →<shapetype LEINT32=31><MultiPatch>
<Point> → <X LEFP64> <Y LEFP64>
<MultiPoint> → <Box>
<NumPoints LEINT32>
<Points>[NumPoints]
<Box> → <Xmin LEFP64><Ymin LEFP64> <Xmax LEFP64><Ymax LEFP64>
<PolyLine> → <Box>
<NumParts LEINT32>
<NumPoints LEINT32>
<Parts LEINT32>[NumParts]
<Points>[NumPoints]
<Polygon> → <Box>
<NumParts LEINT32>
<NumPoints LEINT32>
<Parts LEINT32>[NumParts]
<Points>[NumPoints>
<PointM> → <X LEFP64><Y LEFR64><Measure LEFP64>
<MultiPointM> → <Box>
<NumPoints LEINT32>
<Point>[NumPoints]
<Mmin LEFP64>
<Mmax LEFP64>
<Marray LEFP64>[BNumPoints]
<PolyLineM> → <Box>
<NumParts LEINT32>
66
<NumPoints LEINT32>
<Parts LEINT32>[NumParts]
<Point>[NumPoints]
<Mmin LEFP64>
<Mmax LEFP64>
<Marray LEFP64>[NumPoints]
<PolygonM> → <Box>
<NumParts LEINT32>
<NumPoints LEINT32>
<Parts LEINT32>[NumParts]
<Point>[NumPoints]
<Mmin LEFP64>
<Mmax LEFP64>
<Marray LEFP64>[NumPoints]
<PointZ> → <X LEFP64>
<Y LEFP64>
<Z LEFP64>
<M LEFP64>
<MultiPointZ> → <Box>
<NumPoints>
<Point>[NumPoints]
<Zmin LEFP64>
<Zmax LEFP64>
<Zarray LEFP64>[NumPoints]
<Mmin LEFP64>
<Mmax LEFP^$>
<Marray LEFP64>[NumPoints]
<PolyLineZ> → <Box>
<NumParts LEINT32>
<NumPoints LEINT32>
<Parts LEINT32>[NumParts]
<Point>[NumPoints]
<Zmin LEFP64>
<Zmax LEFP64>
<Zarray LEFP64>[NumPoints]
<Mmin LEFP64>
<Mmax LEFP64>
<Marray LEFP64>[NumPoints
<PolygonZ> → <Box>
<NumParts LEINT32>
<NumPoints LEINT32>
<Parts LEINT32>[NumParts]
<Point>[NumPoints]
67
<Zmin LEFP64>
<Zmax LEFP64>
<Zarray LEFP64>[NumPoints]
<Mmin LEFP64>
<Mmax LEFP64>
<Marray LEFP64>[NumPoints
<MultiPatch> → <Box>
<NumParts LEINT32>
<NumPoints LEINT32>
<Parts LEINT32>[NumParts]
<PartTypes LEINT32>[NumParts]
<Point>[NumPoints]
<Zmin LEFP64>
<Zmax LEFP64>
<Zarray LEFP64>[NumPoints]
<Mmin LEFP64>
<Mmax LEFP64>
<Marray LEFP64>[NumPoints
68
Appendix B: Other Binary Chunk-based File Formats
File Format Name Reference Comments
CEL Animation (ANIM) Sparta 1988 Electronic Arts IFF
IFF Formatted Text (FTXT) Shaw et al 1985 “
IFF Simple Musical Score
(SMUS)
Morrison 1986b “
Planar Bitmap (PBM) REWiki 2009 “
Audio Interchange File Format -
Compressed
Apple 1991
WAVEFORMATEXTENSIBLE Microsoft 2003b Resource Interchange File
Formats (RIFF)
Broadcast Wave Format (BWF) EBU 2001 “
Windows Audio-Video
Interleave (AVI)
Microsoft 2005 “
Animated Windows Cursor
(ANI)
Houghtaling 1996 “
RIFF MIDIfile (RMID) IBM & Microsoft 1991 “
DIVX Media Format (DMF) Wikipedia 2012 “
Device Independent Bitmap
(RIFF RDIB)
IBM & Microsoft 1991 “
WebP Google 2010 “
JPEG Titled Image Pyramid
(JTIP)
ISO/IEC 10918-3:1997
JPEG-LS ISO/IEC 14495-1:1999
ISO/IEC 14495-2:2003
Multiple-Image Network
Graphics
Randers-Pehrson 2001
JPEGn Network Graphics Randers-Pehrson 2001
Binary Interchange File Format Rentz 2008 Excel File Format,
Versions 2, 3, 4, 5, 95, 97,
2000, XP, 2003
3D Studio File format – 3ds Autodesk 1997
3D Studio Material Library File Fercoq 1996
Anim8or Glanville 2003
Animator Pro Cel Crazy Talk
69
Animator Pro Fli Haaland
Animator FLC Comuphase
Animator Pro PIC Maischein
Audio Manager Module Foo 1995
Autodesk Drawing Interchange
Files (DXF) (Binary)
Autodesk 2009
Caligari Truespace Scene/Object
(Binary)
Caligari 1998
CorelDraw Vector Graphics File
(Versions 3.0-X3)
Novikov and Fillipov
Corel Metafile Exchange Image
(CMX)
Corel 1998
DigiTrekker Module Format Beham 1995 Music Modules
Graoumf Tracker Modules Soras 1995 “
Oktalyzer
Zappa 1994 “
Electronic Arts Music (.asf) Anisimovsky 2002b EA Multimedia Game File
Formats, Anisimovky
2002a
Electronic ARTS CMV Multimedia Wiki 2008a “
Electronic Arts DCT Multimedia Wiki 2008b “
Electronic Arts MAD Multimedia Wiki 2008c “
Electronic Arts MPC Multimedia Wiki 2008d “
Electronic Arts VP6 Multimedia Wiki 2008e “
Electronic Arts TGV Multimedia Wiki 2008f “
Electronic Arts TGQ Multimedia Wiki 2008g “
Electronic Arts TQI Multimedia Wiki 2009a “
Imagine 3D Data Description
(TDDD)
Kirvan 1994
LWOB Lightwave 1996 Lightwave 3D Objects
LWO2 Lightwave 2001a “
LWLO Ferguson 1995 “
Luxology Modo 3D Object
(LXOB)
Luxology 2006
Pant Shop Pro (PSP) Jasc 1998
ProVector 2D Drawing Cunni9ff and Orr
ProWrite Document Bayless 1987
Blorb Z-machine Resource File Plotkin Interactive Fiction
70
Quetzal Frost 1997a, 1997b Interactive Fiction
Apple QuickTime Apple 1996, 2001
Apple itunes Video ISO/IEC 14496-14:2003
Apple itunes AAC Audio ISO/IEC 13818-7:2006
ISO/IEC 14496-3:2009
Apple itunes AAC AES
Production Audio
ISO/IEC 13818-7:2006
ISO/IEC 14496-3:2009
JPEG 2000 ISO/IEC 15444-1:2004
JPEG 2000 Codestream ISO/IEC JTC 1/SC 29/WG 1
Annex A: Codestream Syntax
ISO Base Media Format ISO/IEC 14496-12:2012
ISO/IEC 15444-12:2012
MPEG-4 ISO/IEC 14496-14:2003
Motion JPEG 2000 ISO/IEC 15444-3
ITU-T T.802
3GGP Media ETSI 2012
3GPP2 Media 3GPP2 2003
Maya IFF Bitmap Alias-Wavefront 1999
Apple Core Audio format Apple 2005
American Laser Games (MM) Multimedia Wiki 2008h
Smush Animation Format
(SAN)
Multimedia Wiki 2006
Multimedia Wiki 2009b
Structured Data eXchange
Format (SDXF)
RFC 3072
Ogg Vorbis Pfeiffer 2003, Xiph 2010
Simple Vector Array Format Martin 2007
Nested List File (NLF) Blank Aspect 2010a 2010b
Cinema4D Losch 1997
The Sims Interchange File
Format
Baum & Noel 2002
71
Appendix C: Utilities to Pretty Print a Nested List Style Parse
Tree
private static int depth = 0;
public static String prettyPrintTree(Tree t) throws Exception {
return prettyPrintTree(t, (List<String>) null);
}
public static String prettyPrintTree(Tree t, Parser recog)
throws Exception {
String[] ruleNames = recog != null ? recog.getRuleNames() : null;
List<String> ruleNamesList =
ruleNames != null ? Arrays.asList(ruleNames) : null;
return prettyPrintTree(t, ruleNamesList);
}
public static String prettyPrintTree(Tree t, List<String> ruleNames)
throws Exception {
String ruleName = Trees.getNodeText(t, ruleNames);
StringBuilder buf = new StringBuilder();
depth++;
buf.append("(");
buf.append(ruleName);
buf.append(' ');
if (ruleName.startsWith("dt")) {
// Since t is a Rule Node and it starts with dt,
// it must be the context of one of the data type rules.
// All the data type rules have a return field named value.
Object value = t.getClass().getField("value").get(t);
if (value instanceof String) {
String str = (String) value;
if (depth + str.length() > 0) {
buf.append(prettyPrintTree(str));
} else {
buf.append(str);
}
} else {
72
buf.append(t.getClass().getField("value").get(t));
}
} else if (t.getChildCount() > 0) {
for (int i = 0; i < t.getChildCount(); i++) {
buf.append('\n');
for (int j = 0; j < depth; j++) {
buf.append(" ");
}
buf.append(prettyPrintTree(t.getChild(i), ruleNames));
}
}
buf.append(")");
depth--;
return buf.toString();
}
public static String prettyPrintTree(String str) throws Exception {
StringBuilder buf = new StringBuilder();
int subLength = 70 - depth;
do {
buf.append('\n');
for (int j = 0; j < depth; j++) {
buf.append(" ");
}
if (str.length() > subLength) {
buf.append(str.substring(0, subLength));
str = str.substring(subLength + 1);
} else {
buf.append(str);
}
} while (depth + str.length() > 70);
return buf.toString();
}