converting unstructured docs to xml/dita/epub

41
Converting Unstructured Docs to XML/DITA/ePub Mark Gross Linda Morone

Post on 18-Oct-2014

1.140 views

Category:

Technology


1 download

DESCRIPTION

DCL's Presentation for LavaCon 2011

TRANSCRIPT

Page 1: Converting Unstructured Docs to XML/DITA/ePub

Converting Unstructured Docs to

XML/DITA/ePub

Mark Gross Linda Morone

Page 2: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 2

Background of Data Conversion Laboratory

30 years of experience providing electronic document conversion

services meeting the needs of technology…today & in the future

• More than 1 billion pages converted to date

• US Based project management team

• Global capabilities

• Transform legacy & future documents

• From any format to any format

• Specialize in complex projects

• Identify redundant data for content reuse

• Employ a proven automated process

• Quality Assurance service is standard in all projects

• Additional services include consulting, composition & transcription &

translation

Page 3: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 3

Serving All Industries

• Publishers

• Government

• Defense

• Life sciences

• Automotive

• Aerospace

• Heavy and Industrial Equipment

• Financial Services

• Manufacturing

• Computing

• Utilities

• Semiconductors

• Telecommunications

Page 4: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 4

Serving a Broad Client Base

Page 5: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 5

• Comply with regulations

• Match Industry standards

• Meet customer expectations & needs

• Support internal departments

• Expand into new markets

• Multi-purpose content

Converting Legacy Data … Is it Worth the Expense?

Page 6: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 6

Legacy Conversion: Fact or Fiction

Client’s Perception

• Painful Process

• Complex

• Expensive

• Drain on Resources

Reality

• Expertise & Planning

• QC & Automation

• Guaranteed Results

• Low Costs

Page 7: Converting Unstructured Docs to XML/DITA/ePub

(Confidential)

So … Which Format do you Choose

NLM and Publishing DTDs

• Support traditional publishing

• Flexible open standard

• Freely available

• Human-readable format

DITA and Module-Based DTDs

• Designed for multi-purposing and

content reuse

• Topic based & modular

• Supports

– Multiple variants

– Multiple languages

– Context independent content

ePUB and Rendering-Focused DTDs

• Designed for e-readers & mobile

devices

• Freely available

• Open standard

• Adaptable to

– Books

– Documents

– Manuals

– User guides

• Support for print publishing

requirements is limited

7

Page 8: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 8

The Story with ePub and Rendering-Focused DTDs

• ePub is an emerging standard used for most eReaders

• Mobi is also a large player, proprietary to Amazon Kindle

• ePub is an evolving standard

• ePub is supported differently by different eReaders

• There are no “Silver Bullets”

• eBooks are publications and need care in their production

• Not just novels; recent DCL survey shows 75% will be using

eBooks for complex materials

Page 9: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 9

Things to Keep in Mind When Converting

• Smaller screen size

• Large tables may not fit

• Not all Character Sets supported by all devices

• MathML not currently supported

Page 10: Converting Unstructured Docs to XML/DITA/ePub

(Confidential)

OCR/Text Extraction

• Special Characters

• Emphasis

• Ligatures

• Hyphens – Soft and Hard

Pitfalls of Text Extraction

10

Page 11: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 11

Converting exactly per source

may lead to problems …

Handling of Objects Mid-Paragraph

Page 12: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 12

Math as Images – Changing Font Size Doesn’t Change Images

Page 13: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 13

Unicode Symbols Will Adjust with the Font Size Change

Page 14: Converting Unstructured Docs to XML/DITA/ePub

(Confidential)

Large Tables

Table as Text (searchable but cut off) Table as Image

14

Page 15: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 15

When Layout Matters

Testing Materials Poetry

Page 16: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 16

Letter Recipe

When Layout Matters (cont’d)

Page 17: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 17

Some Notes on the Kindle

• Designed for reading long documents

• Designed for simplicity

• Has some features that others don’t

• But also missing some features that others have

• Therefore, need to design the conversion differently

Page 18: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 18

iPad screenshot Kindle screenshot

Glossary Definitions

Page 19: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 19

Use of CSS “Float” Style

iPad screenshot Kindle screenshot

Page 20: Converting Unstructured Docs to XML/DITA/ePub

(Confidential)

Use of Borders

iPad screenshot Kindle screenshot

20

Page 21: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 21

Color/Spanning/Large Tables

iPad screenshot Kindle screenshot

Page 22: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 22

The Story with NLM and Publishing DTDs

• Well-documented public domain standard.

• Well-tested on a wide variety of materials; designed for

complex publishing.

• Originally designed with NIH support for Scientific, Technical,

and Medical (STM) publications.

• Extended to be robust for many more uses; widely used in

non-STM areas.

• DocBook and PRISM are other standard DTD’s; each with its

own strengths – all designed for “print” publications.

Page 23: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 23

Choosing the Content to Convert

• TOC

• Index

• Labels

• Titles

• List of Table, Figures, etc.

Which content will be auto-generated?

Page 24: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 24

Capturing Items as Multiple Formats

Math as images and MathML

Tables as images and XHTML

<disp-formula id="FD1">

<mml:math id="M1" display='block'>

<mml:semantics>

<mml:mrow>

<mml:mi>L</mml:mi>

<mml:mo>&#x0003D;</mml:mo>

<mml:mo>&#x02211;</mml:mo>

<mml:mrow>

<mml:msub>

<mml:mrow>

<mml:mi>l</mml:mi></mml:mrow>

<mml:mi>i</mml:mi></mml:msub>

<mml:mo>&#x0002F;</mml:mo>

<mml:mi>N</mml:mi></mml:mrow>

<mml:mo>&#x0002E;</mml:mo></mml:mrow>

</mml:semantics></mml:math>

</disp-formula>

Page 25: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 25

Determining Data Elements

Appearance Based: Content Based:

• <email> - @

• <uri> - www

• <degrees> - PhD, MD, BA

• <fig> - Figure, Illustration, Chart, Scheme

• Alignment

• Placement

• Point size

• Font

Page 26: Converting Unstructured Docs to XML/DITA/ePub

(Confidential)

Granularity of Tagging: Front Matter

26

Page 27: Converting Unstructured Docs to XML/DITA/ePub

(Confidential)

Granularity of Tagging: Back Matter

• Are the references Harvard or Numeric?

• Is the author name last/first or first/last?

• What is the placement of the year within the citation?

• Is a comma or period used after the author names?

27

Page 28: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 28

The Story with DITA and Module-Based DTDs

• Allows for modularization of your content with Topics,

and easy re-use in multiple outputs

• Pre-packaged & ready to use XML (almost)

• Ready-to-go for techdocs (mostly)

• Infrastructure included - taxonomy (DTD and

schema); printing stylesheets; lots of tools

• Printable with standard tools

• Extensible with specializations

• Further specializations for publishing, testing, and

other specialized areas

• Content-based

• What do you when things don’t fit

Page 29: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 29

• DITA is a conceptual departure from linear information – and is difficult

for many to get used to

• Turns the traditional book into a collection of Topics

• Topics can be thought of as interchangeable parts

– to be reassembled in multiple ways

– to be repurposed for multiple outputs

– to be reused across multiple products

• …but your documents weren’t likely to have been designed to do this.

What Makes DITA Conversions Difficult

“Getting there using DITA is like building with prefabricated modular

components that can be quickly assembled into a suitable structure.”

- Doug Henschen, intelligententerprise.com

Page 30: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 30

Structuring a Book into Topics in DITA

“Getting there using DITA is like building with prefabricated modular

components that can be quickly assembled into a suitable structure.”

– Doug Henschen, intelligententerprise.com

Reference 1

Concept 2

Concept 4

Reference 5

Task 1

Task 2

Task 3

Reference 3

Reference 4

Book 1

Reference 2

Concept 5

Concept 1

Task 2

Task 3

Book 2 Concept 3

Concept 2

Task 2

Task 1

Book 4

Concept 1

Book 3

Reference 1

Concept 3

Concept 5

Task 1

Reference 5

Concept 2

DITA Content

Management System

Concept 1

Concept 2

Concept 3

Concept 4

Concept 5

Task 1

Task 2

Task 3

Reference 1

Reference 2

Reference 3

Reference 4

Reference 5

Task 1

Reference 1

Concept 1

Book A

Reference 2

Task 1

Book B

Reference 1

Reference 3

Task 2

Concept 2

Task 3

Reference 2

Page 31: Converting Unstructured Docs to XML/DITA/ePub

(Confidential)

Further Complications in DITA Conversions

• There’s the usual conversion issues

– Accuracy of the transferred text

– Tables

– Math

– Special Characters

• There’s also the structuring issues

– Identifying topics

– Identifying reusable content

• And the people issues

– Deciding what needs re-authoring

– Getting used to a new “document” paradigm

– Getting rugged individualists to collaborate more

31

Page 32: Converting Unstructured Docs to XML/DITA/ePub

(Confidential)

• Architectural constraints of DITA – the square pegs

– Multiple steps within a single task topic

– Task\Procedure authored as a table in the source

– Presence of untitled tasks/topics in the source

– References to page numbers (irrelevant cross-references)

– Having more than two levels of steps

• How your rendering system will handle XML

– Figures

– Steps

• Other conversion considerations:

– Hierarchy in Map Files

– Metadata in Map Files and Topics

– Index Terms

– Conditional Text

– Glossary Terms

– Content Terms

32

Overview of Typical DITA Technical Conversion Issues

Page 33: Converting Unstructured Docs to XML/DITA/ePub

(Confidential)

Square Peg 1 - Task / Procedure Authored As a Table

Issue:

Tasks are done as tables rather than numbered lists. If there’s no

clear consistent pattern, then automated conversion keeps the

tables as tables, and steps are not tagged as steps.

1 Overview In general, backup and recovery refers to the

various strategies and procedures involved in

protecting a system against data loss.

2 Backup strategy and

frequency

A backup is a copy of key files. Files included

in the backup are:

• A logical backup of the database

1. Key system files

• Network files

• Timezone

2. Configuration files …

33

Page 34: Converting Unstructured Docs to XML/DITA/ePub

(Confidential)

Square Peg 2 - Multiple Steps In A Single Task

Issue:

Only one set of steps is allowed in a single task topic. When a task has two

sets of steps within a topic, such as for two different scenarios, only one of

the scenarios can be tagged as <steps> as per the DTD.

Example:

Replacing an XYZ Module

Use this procedure to replace an XYZ module

Remove XYZ Module

1. Loosen the screws.

2. Disengage the ejectors

3. Pull the module straight out

Insert Replacement XYZ Module

1. Align the module.

2. Insert the module, pressing in firmly

3. Engage the ejectors

4. Securely tighten the screws

34

Page 35: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 35

Square Peg 3 - Irrelevant Cross-References

Issue:

Conversion to DITA may make some source cross-references irrelevant.

For example, assuming all empty chapter headings are dropped, a

reference to a chapter is no longer valid. In these cases, a <required-

cleanup> tag is inserted to flag these occurrences for clean-up.

See Chapter 1, Introduction on page 2

Would be tagged as:

See <required-cleanup><xref href=”chap1”> Chapter 1,

Introduction</xref></required-cleanup>

NOTE: Hard-keyed page numbers are typically dropped from the cross-

reference string since they are no longer relevant in DITA.

Page 36: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 36

• It seems like such a pain to go through all the old luggage

in the attic.

• There is always a need for some rewriting - few writers

have the clairvoyance to author content with the intent that

be converted in the future – might as well rewrite it all.

• My writers aren’t very busy right now anyway.

• It’s more fun and seems like less trouble to author anew.

So … Maybe You Shouldn’t Bother Converting Your Content?

Page 37: Converting Unstructured Docs to XML/DITA/ePub

(Confidential)

• Throwing it out and starting over is an expensive option

– In DITA, rewriting at $25/page vs. converting at $3-$4/page

– The hidden costs of redoing index entries, links and other features you’ve

built in

– The hidden cost of reviewing, reproofing, and recertifying it all

• It’s usually easier to use what you have as a base, and convert over

– Needs planning

– Needs time

• Planning for a good conversion experience

– Which content will you need?

– Which content is worth converting?

– Which content is suitable for re-use in multiple places?

– What tools are available?

– How to specify the conversion to get it right?

– When do you start all this planning?

In Reality … Converting Your Content is Worth the Bother

37

Page 38: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 38

Conversion Scope Options

2

time

cost

1

3

Option 1: Convert nothing

• No conversion costs

• Delayed ROI

Option 2: Convert everything

• High conversion costs

• Reduced ROI

Option 3: Convert ‘frequently used’ documents

• Some conversion costs

• Maximized ROI

Page 39: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 39

What to Convert, and in What Order

• Categorizing

– Active documents in good shape

– Active documents that need a lot of work

– Somewhat inactive document that will likely be retired

– Archival materials

• Prioritizing

– Documents that are most used

– Documents that are customer favorites

– Documents with longest product life

– Start with most recent documents and go back

• Identifying the process

– Can be converted as is

– Can be converted with some work

– Needs to be rewritten

– Don’t convert – just keep archival copies

Page 40: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 40

Closing Thoughts

• Know the scope of what you want to accomplish

– Are you trying to get eBooks quickly, or are you changing your

publishing process

– Are you moving everything, or will a phased approach work

– Will your content work naturally with the selected DTD

• Start the conversion process early

– Shifts the critical path; speeds the process; reduces cleanup

– Organizing early lets more of the work be done by the content

owners

– eases the training and change acceptance burdens

– setting up collaborative teams sets the tone and allows one to

“divide and conquer”

• Converting legacy data is not trivial

– …but faster, safer and less expensive than rewriting

– Each DTD has special considerations to be taken into account

– Much can be automated, but it needs planning

Page 41: Converting Unstructured Docs to XML/DITA/ePub

(Confidential) 41

Questions...

& Answers

Data Conversion Laboratory

61-18 190th St., 2nd Floor

Fresh Meadows, NY 11365

Telephone: (718) 357-8700

Fax: (718) 357-8776

Web: http://www.dclab.com

Mark Gross, President

[email protected]

718-307-5711

Linda Morone, Sr. VP of Sales & Marketing

[email protected]

718-307-5728