2012 ehumanities amsterdam - descartes text conversion: lessons learned

Post on 26-Jun-2015

286 Views

Category:

Education

1 Downloads

Preview:

Click to see full reader

DESCRIPTION

The arduous process of producing a digital text of Descartes' letters, including mathematical formulas. It was a subtask of the CKCC project at the Huygens Institute. Lessons learned. With Erik-Jan Bos, Utrecht.

TRANSCRIPT

Letters from Descartes in

digital formatAn exercise in conversion

Dirk Roorda@ eHumanities 2012-01-26

the task the method the lessons the result

◦ demo

overview

The Task: converting from ...JapAM

Descartes Correspondence

ca. 700 letters

69,237 lines

600 formulas

4.2 MB (without the 311 pictures)

The task: converting to ...CKCC corpus Descartes

XML : Text Encoding Initiative (TEI)

~ 35,000 elements, of which7,200 metadata

7,700 paragraphs6,200 formulas

6,000 text-formattings4,200 structure

2,900 page-breaks538 images

The (re)Sources

EJB

Metadata

Google Books

EJB ‘s head

observation

non-algorithmic changes

consolidation

proofs

The method

use digital equipment:

-your text-editor

-your scripting language

-your regular expressions

Observation

observation: italic scopes

replace=(.*?)$

by<italic>match1</italic>

???

Aargh!#@\€]

observation: greek

non-algorithmic changes

closers: hints

consolidating: metadata

... formulas meta closers ...

conversion process

canonical

initial

corrected

improved

checked metadata combining

merging meta

proofs: formulas

proofs: formulas in gif

quick formula checking

The anatomy of conversion

convert.pl

100 KB of program code text=25 densely typed pages=3427 lines

of which

2175 real code lines

Code/Input = 1/32

1/3 of the tasks need 2/3 of the codeformulas: (2) 37 %headers, openers, closers: (3) 16 %meta and images: (3) 11 %

run time of same tasksformulas: (2) 29 %headers, openers, closers: (3) 6 %meta and images (3) 10 %total run time (25) 40 sec

Statistics

1. Unicode is your friend2. Split into many subtasks3. task = configuration + workflow4. Count and check5. Performance matters6. Do not give up automation

The tricks of conversion

1. Unicode is your friend

(2a) that can be run separately

(2b) that can be reordered easily

2. Split into many subtasks

3. task = config + workflow

4. Count and check (ad nauseam)

was 30+ secondsis now 2.07 secondsmany new subtasks based on same template(gain = 15 * 30 = 7.5 min per run)many, many runs before everything is OK(gain = 100 * 7.5 = 12.5 hours CPU-time)

5. Performance matters!

we used a lot of expert knowledgewhich has all been transferred to- the source- consolidated extra inputsso the conversion is still repeatable and modifiable

6. Do not give up automation

source formulas meta closers results

corrections hints hints hints CKCC

conversion program

Thank You

top related