word to latex

8
Word-to-L A T E X – specification Michal Kebrt April 19, 2005 Contents 1 Introduction 1 1.1 Text processors ................................... 2 1.2 T E X and L A T E X ................................... 2 1.3 Word vs. L A T E X ................................... 2 2 Word to L A T E X conversion 3 2.1 Internal conversion ................................. 3 2.2 External conversion ................................. 3 2.3 What to expect ................................... 3 3 Word-to-L A T E X 3 3.1 Introduction ..................................... 3 3.2 Figures ....................................... 4 3.3 Mathematical equations .............................. 4 3.4 Structural parts of a document .......................... 5 3.5 Formatting ..................................... 5 3.6 Interaction with L A T E X ............................... 6 3.7 Output options ................................... 6 3.8 Miscellaneous options and features ........................ 6 3.9 Program settings .................................. 6 3.10 Libraries ....................................... 6 3.11 Future improvements ................................ 6 4 Conversion programs 7 4.1 Word2T E X ...................................... 7 4.2 RTF to L A T E X convertors .............................. 7 4.3 Other convertors .................................. 7 1 Introduction Word-to-L A T E X will be a program for converting documents written in Microsoft Word into L A T E X format. The program will be written as a software project (PRG033) at Charles University, Faculty of Mathematics and Physics. 1

Upload: iordache

Post on 08-Nov-2014

39 views

Category:

Documents


4 download

DESCRIPTION

LaTex

TRANSCRIPT

Page 1: Word to Latex

Word-to-LATEX – specification

Michal Kebrt

April 19, 2005

Contents

1 Introduction 11.1 Text processors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2 TEX and LATEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.3 Word vs. LATEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

2 Word to LATEX conversion 32.1 Internal conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.2 External conversion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32.3 What to expect . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3

3 Word-to-LATEX 33.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33.2 Figures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.3 Mathematical equations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43.4 Structural parts of a document . . . . . . . . . . . . . . . . . . . . . . . . . . 53.5 Formatting . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53.6 Interaction with LATEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.7 Output options . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.8 Miscellaneous options and features . . . . . . . . . . . . . . . . . . . . . . . . 63.9 Program settings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.10 Libraries . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 63.11 Future improvements . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

4 Conversion programs 74.1 Word2TEX . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.2 RTF to LATEX convertors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 74.3 Other convertors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

1 Introduction

Word-to-LATEX will be a program for converting documents written in Microsoft Word intoLATEX format. The program will be written as a software project (PRG033) at CharlesUniversity, Faculty of Mathematics and Physics.

1

Page 2: Word to Latex

1.1 Text processors

Microsoft Word, WordPerfect, OpenOffice.org Writer and many more are examples of wordprocessors and so-called WYSIWIG1 text editors. These programs enable a user to createdocuments and their design and layout interactively by selecting from a wide variety of com-mands in the program menu. The user always sees the document in its final form – all thedocument formatting is displayed on the screen (for example, a heading appears in a boldand bigger font).

1.2 TEX and LATEX

LATEX is a typographic system, which is used for typesetting of science and mathematicaldocuments in a high typographic quality. This system is also appropriate for creating manydifferent kinds of documents, from plain letters to large books. LATEX is also a standard forcontributing manuscripts to a lot of (scientific) conferences.

The main difference between LATEX and Word is that when you make a document in LATEXyou must usually write all the text and commands (for example the \textbf{foo} commandmakes the text ’foo’ bold) directly into a plain text file and you can’t see the final documentlook until you run a program which generates PostScript or PDF file.

LATEX uses TEX [1] – a computer program for typesetting which was developed by professorDonald E. Knuth.

1.3 Word vs. LATEX

It’s not easy to say what’s better – whether Word as an example of word processor or LATEX.Everybody needs and likes different approaches for writing documents. Here are some advan-tages of LATEX and Word.

LATEX advantages:

• User can use predefined document templates (e.g. for articles and books) with profes-sional look.

• Great means for writing mathematical expressions.

• It’s not necessary to specify the document formatting and look, it depends on a docu-ment style. User writes only commands determining the logical structure of a document(e.g. sections and footnotes).

• Many add-ons (e.g. for inserting graphics or hyperlinks).

• Wide portability of TEX and LATEX system.

• It’s free.

Word advantages:

• Easy to use and learn for most of people.

• User always see the document in its final form.1what you see is what you get

2

Page 3: Word to Latex

2 Word to LATEX conversion

There are two possible ways how to convert Word documents to LATEX format. A lot ofinformation and also the terminology ’internal’ and ’external conversion’ come from the article[3].

2.1 Internal conversion

Internal conversion is carried out within the Word program using its object model. It’s notsignificant whether you use a VBA macro or some external program. The most importantthing is that all document parts and all document information including formatting, Wordapplication settings, etc. is accessible and usable.

An example of a program using internal conversion is Word2TEX [4].

2.2 External conversion

External conversion is performed without the Word application by an external program.There are at least two ways how to externally convert a Word document into LATEX – eitherdirectly access the Word document as a binary file or save the document in a more accessibleformat (typically RTF) and then convert it into LATEX. External conversion has one bigdisadvantage in comparison with internal conversion. It’s usually impossible to retrieve allthe document information especially about the logical structure of the document.

The first method is completely independent on Word installation so it can be performedoutside the Windows environment. Although the idea of parsing the Word binary formatis rather unimaginable there are a few programs that use this method – e.g. word2x [5] orAntiword [8].

Programs that convert RTF into LATEX are: rtf2latex2e [6], w2latex [7].

2.3 What to expect

It’s not possible to perform 1:1 conversion as Word and LATEX are very different documentpreparation systems. The most important is surely to convert all the text content – it espe-cially means to correctly translate special characters (e.g. →, σ, etc. or % to \%). Conversionprograms will generate the better results the better the Word document is stuctured andformatted. This the reason why users should use paragraph styles and appropriate Wordfunctions for inserting footnotes, bibliography, index, etc. Once users follow these rules con-version programs can properly convert almost every part of a document.

Another important question is how to convert figures (including embedded ActiveX ob-jects) and mathematical equations. This issue is not very easy and will be described in thenext section in details.

3 Word-to-LATEX

3.1 Introduction

Word-to-LATEX will perform so-called internal conversion since it will use Word Object Modelto access all the document parts and information. Microsoft Visual Studio 2003 and C#

3

Page 4: Word to Latex

language were chosen as a develop environment. Word-to-LATEX will run only on Windowswith Microsoft Word installed.

Following sections describe Word-toLATEX features and options that can be set.

3.2 Figures

One of the most important things to convert are figures – images, ActiveX controls (e.g. Mi-crosoft Excel graphs), automatic shapes and so on. Word-to-LATEX will support two differentkinds of figure conversion – as an EPS image (containing PostScript commands) or as animage in its original format (JPEG photo will be exported as a JPEG file, Excel graph orautomatic shape as a GIF file). User will have to choose one type of the conversion that willbe used for the whole document.

Word-to-LATEX will have an option to export only figures (not text, lists, etc.) so userscan first save all figures as EPS, then as raster images and finally choose what’s better foreach figure.

Conversion to EPS format will be performed by an external PostScript printer driver(e.g. Apple LaserJet II) which can be easily installed on Windows. The conversion procedureis rather ponderously – the figure will be first copied into the clipboard, then pasted ina temporary Word document which will be printed into an EPS file using the PostScriptprinter driver. Once this is done, the Bounding Box property specifying the picture size mustbe edited to match the picture size in the Word document. Unix command-line programps2eps [10] can edit the Bounding Box property automatically but on Windows it requiresa few dependencies so I will edit this proterty without any external program – it means tochange four numbers in the head of an EPS file which is a plain ASCII text file.

On the other hand the export to the picture’s original format is quite easy. When thedocument is saved as a web page all the figures (including ActiveX objects etc.) are exportedas JPEG, PNG or GIF files.

3.3 Mathematical equations

There are three ways how to insert mathematical equations into a Word document. The firstis the EQ field (Insert – Field) which can be used even for quite complicated expressionscontaining sums, brackets, matrices, fractions, etc. The expressions prepared with the EQfield must be written in a source code similar to LATEX (e.g. \f(5;3) for a fraction) but ithas some limitations – for example you can’t create a triple integral. As there is no APIfor the EQ field the conversion to LATEX must be performed by parsing the source code ofexpressions.

Equation Editor (typically in version 3) is a part of Microsoft Office package. It’s a visualeditor without any mode for writing expressions in a source code like in the EQ field. Inspite of this Equation Editor can convert EQ field expression into its own format but notback. The only way how to convert Equation Editor expressions to LATEX is to parse theirbinary format (there’s no public API). Although this format is public [11] for me it’s a hardimaginable method.

MathType [12] is a professional (and commercial) version of Equation Editor with somegreat improvements – numbered equations, automatic recognition of variables, functions andconstants, export to GIF or EPS, converting to MathML or LATEX. MathType also has anAPI for basic work with expressions (setting converting traslators, converting and saving

4

Page 5: Word to Latex

expressions, etc.). As MathType can work with Equation Editor and the EQ field expressionsthis API enables to convert all the expressions within a Word document to LATEX. But theMathType SDK Derivative Works Distribution License Agreement [13] tells that programsusing the SDK must be distributed free of charge and only within the programmer’s companyor faculty.

So, what mathematical expressions will Word-to-LATEX convert? There are two possibil-ities – either I’ll make a convertor for EQ field expressions and all the other mathematicalexpressions will be converted as figures (typically EPS). Or I can use MathType SDK forconverting all the expressions to LATEX.

3.4 Structural parts of a document

• footnotes

• cross-references (to headings, figures, etc.)

• titles (for figures, tables, etc.)

• hyperlinks (will be converted to footnotes or references)

• tables

• lists (ordered and unordered)

• headings (built-in Word paragraph styles Heading x will be converted to appropriate sec-tion commands by default (e.g. Heading 1 to \section), this mapping can be changed)

• index (inserted by the XE field)

• table of contents (not converted, LATEX makes the table of contents automatically with\tableofcontents command)

• table of figures

• references (inserted by the TA field)

3.5 Formatting

• user can define mapping of paragraph and character styles to LATEX commands (e.g.style ’preformated’ to \verb), optionally a special environment for each style can becreated

• the text in italics, bold and other styles will be converted to appropriate LATEX com-mands (e.g. \emph, textbf), the default mapping can be changed

• the font size can’t be set exactly in LATEX so there’ll be a point range for each command(e.g. 8− 10 for \small), the default ranges can be changed

• page breaks

• paragraph indenting and aligning + 2

• text boxes +2+ marks features which Word2TEX [4] (the best Word to LATEX convertor) doesn’t have

5

Page 6: Word to Latex

3.6 Interaction with LATEX

• editable document preamble; macros like %Author, %Title used in the preamle will bereplaced with respective information from the Word document

• [LATEX:cmd1]...[\\LATEX:cmd2] like commands can be used in a Word document,for example [LATEX:\textbf{]foo[\\LATEX:}] results in bold ’foo’ +

3.7 Output options

• program will produce LATEX2ε output file, but it’ll be designed so that other formatscould be easily added or programmed

• character set of the output LATEX file

• symbol for the end of a line (CRLF, LF, CR) in the output file

• wrap paragraphs in the output file after x characters or not

3.8 Miscellaneous options and features

• automatic detection of a page size or symbolically setting (A4, etc.) +

• automatic detection of page margins +

• option for setting document class (article, book, etc.) +

• translation of special characters and symbols (the default mapping can be changed)

3.9 Program settings

All the program settings will be saved in a XML file with public format so users will be ableto edit it and suit the program behaviour their needs. It’ll be also possible to set the programby a dialog. The option for saving and loading the program settings will enable users to createa couple of converting styles for different kinds of documents and then just select one and useit.

3.10 Libraries

• MathType SDK will be probably used for converting mathematical equations to LATEX

• .NET System.Xml library for parsing and creating files with program settings

• .NET System.Encoding library for converting between different character sets

3.11 Future improvements

• processing of numbered equations

• better recognizing of mathematical expressions in regular text

6

Page 7: Word to Latex

4 Conversion programs

4.1 Word2TEX

Word2TEX [4] is surely the best Word to LATEX convertor. It has all the features from theprevious section which weren’t marked with + and a few additional functions:

• output file in a couple of formats (e.g. LATEX2ε, LATEX 2.09, AMS-LATEX)

• converts coloured text using special package

• converts hyperlinks using hypertex package

• numbered equations

• inserting extra commands for PDFTEX

• user can define own mapping for mathematical expressions

4.2 RTF to LATEX convertors

• rtf2latex2e [6] – produces quite nice LATEX output, processes font styles, footnotes,tables, paragraph styles, Equation Editor 3.0 equations and some figures

• other RTF to LATEX can be found at CTAN sites [9], but they can’t usually process newversion of RTF format

4.3 Other convertors

• word2x [5] – wide portable external convertor

• Antiword [8] – converts to only plain text or PostScript, performs external conversion,wide portable; processes font styles and sizes, footnotes, lists, tables, etc., has problemswith figures

• a couple of very old converters (e.g. Word TEX) can be found at CTAN sites [9]

References

[1] Donald E. Knuth. The TEXbook, Volume A of Computers and Typesetting, Addison-Wesley Publishing Company (1984), ISBN 0-201-13448-9.

[2] Ne prılis strucny uvod do systemu LATEX2ε. http://www.penguin.cz/~kocer

[3] Marion Neubauer. Conversion from WORD/WordPerfect to LATEX, MAPS 14, 1995,120-124, http://www.ntg.nl/maps/maps14.html.

[4] Word2TEX, http://www.chikrii.com

[5] word2x, http://word2x.sourceforge.net

[6] rtf2latex2e, http://sourceforge.net/projects/rtf2latex2e

7

Page 8: Word to Latex

[7] w2latex, http://www.tug.org/utilities/texconv/w2latex.html

[8] Antiword, http://www.winfield.demon.nl

[9] CTAN, ftp://ftp.cstug.cz/pub/tex/CTAN

[10] ps2eps,http://www.telematik.informatik.uni-karlsruhe.de/~bless/ps2eps.html

[11] Equation Editor expressions format,http://www.dessci.com/en/reference/sdk/default.htm#MTEF

[12] MathType, http://www.dessci.com/en/products/mathtype

[13] MathType SDK Derivative Works Distribution License Agreement,http://www.dessci.com/en/support/eula/mathtype/mtderivlic.htm

8