historical spelling normalization
DESCRIPTION
2012 (Nov.) Martin Reynaert, Iris Hendrickx, and Rita Marquilhas. «Historical spelling normalization. A comparison of two statistical methods: TICCL and VARD2», Second Workshop on Annotation of Corpora for Research in the Humanities, Universidade de Lisboa, Lisboa.TRANSCRIPT
BackgroundData and Methods
ResultsConclusions
Historical spelling normalization. A comparison oftwo statistical methods: TICCL and VARD2
Martin Reynaert, Iris Hendrickx and Rita Marquilhas
Tilburg University, The Netherlands and Centro de Linguıstica, Universidade deLisboa, Portugal
November 29, 2012
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
Background & Motivation
Aim: Automatic spelling variation reduction in a historical corpus
Goal was to reduce the problem of spelling variations in thePortuguese CARDS-FLY corpus of personal letters written in the16th to the 20th century.
This corpus aims to provide a digital version of the letters whilekeeping and recording as much as possible from the originalhandwritten letters including all spelling variation. For certaintypes of research or for querying the corpus, this variation canprevent good results.
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
Overview
Origin of the personal letters
Introduction
How does the corpus look?
What data set did we use for the experiments?
Description of the two tools: VARD2 and TICCL
Results
Discussion
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
The corpus
The CARDS-FLY corpus1 is ongoing work [Marquilhas, 2012] andaims to collect a total of 4000 personal letters. Currently 3455letters have been transcribed. The letters are manually transcribedinto an electronic XML-TEI file format including rich and detailedhistorical and sociological meta-data.
Origin of the personal letters
1500-1800: from religious legal proceedings, as evidence usedby the Inquisition,
19th C: legal evidence, in criminal cases heard by thePortuguese Royal Appeal Court,
20th C : soldiers who fought in World War I or in thePortuguese Colonial War, political prisoners and emigrants.
1CARDS-FLY corpus: http://alfclul.clul.ul.pt/cards-fly/ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
Letter from Margarida to her lover, Jose, 1778Ciphered words are in the Masonic - or Pigpen - code and they refer to religious and Inquisition concepts
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
Manual transcription of the letter in XML (TEI v.5)
Figure: Full description at: http://alfclul.clul.ul.pt/cards-flyACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
Aim: Spelling normalization of the transcription
Figure: English translation: I have more than once asked Your Honourand begged Your Honour to leave me alone. But Your Honour hasinsisted on defying me, dishonouring me, lessening me, engaging in gossipabout me at every corner, both by words spoken and by letters written towhoever you choose. I remind you, speaking as a friend...
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
Data set
For these experiments
Random subset of 200 letters from the CARDS-FLY corpus.
Tokenised, and names are converted to string ‘NAME’.
Normalisation and POS manually verified by a linguist.
This data set was split into 100 letters for training the tools,and 100 for the evaluation set.
Evaluation scores are computed with recall, precision andF-score.
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
Statistics
Table: Statistics for the evaluation set of 100 letters, divided into the four time periods. # Tok/file shows the
average number of tokens per letter, ‘#Norm/file’ the average number of manual spelling corrections per letter and
‘% Norm/tok’ is the percentage of all tokens that is normalised.
Period Files Tok #Tok/file #Norm/file %Norm/tok1500-1700 10 2262 226.2 56.9 25.21701-1800 28 13913 496.9 120.8 24.31801-1930 43 14343 333.6 60.7 18.11931- 1974 19 6817 358.8 16.1 4.2
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
VARD2 normalization tool
VARD2 developed for historical English and works as follows:
VARD2 is first trained on data to set the parameters of the tool.
Each word is checked against a modern lexicon.
Unknown words are potential spelling variants.
For each variant, generate candidate modern counterparts using theHDBP2 variant list, character rewrite rules and a Soundex algorithmto find phonetically similar counterparts.
Each candidate gets a confidence weight.
If above threshold, candidate replaces variant.
We replaced the English resources with Portuguese ones, re-usingseveral existing resources.
2Historical Dictionary of Brazilian PortugueseACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
Text-Induced Corpus Clean-up: Introduction
TICCL for TYPOS and OCR-errors
Tool to perform large scale, unsupervised spelling correction ofcorpora.
Spelling correction = reduction of lexical variation caused bytypos, OCR-errors, historical orthographical changes...
Prototype developed during a pilot project by invitation of theNational Library, The Hague.
Production version developed according to KB specifications,second half 2008.
Development continues, Open Source release soon.
Is to be made multilingual, first paper on Portuguesepresented here.
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
TEXT-INDUCED CORPUS CLEAN-UP: BASICRETRIEVAL MECHANISM
Represent identical bags of characters (i.e. word stringssharing the same bag of characters) by an identifyingnumerical value,
Use this value as the index key to the word strings in adatabase
Perform simple calculations to retrieve variants from thedatabase.
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
ANAGRAM HASHING
Key(w) =
|w |∑i=1
f (ci )n
A bad hashing function: produces collisions.
Lines up ANAGRAMS: strings consisting of the same bagof characters.
In practice, we use the code value of each character in thestring raised to the power 5.
Values obtained for the string are summed.
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
ANAGRAM HASHING II
CAT = anagram of ACT and TAC
A + C + T = 655 + 675 + 845 = 6,692,535,156
C + A + T = 675 + 655 + 845 = 6,692,535,156
ALLOWS FOR ADDITION AND SUBTRACTIONSAME APPLIES FOR WORD COMBINATIONS, PHRASES,SENTENCES...Great for discovering anagrams: citric critic, cosmic comics,pentatonische pistachenotenBASIS FOR TISC: Text-Induced Spelling Correction
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
ANAGRAM HASHING III
Given ANAGRAM VALUE (AV): 6,692,535,156
AV(ACT) + 845 (plus T) = TACT
AV(ACT) - 675 (minus C) = AT, TA
AV(ACT) - 845 + 825 (minus T, plus R) = CAR
AV(ACT) - 845 + 785 + 835 (minus T, plus N, plus S) =CANS/SCAN
Focus word approach: take a word and systematicallysearch for its variants, then take the next word..., etc.
OR:
Character Confusion approach: systematically search forall word pairs in the corpus that display a particulardifference in characters for all possible confusions given aparticular edit distance
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
The corpusVARD2TICCL
TICCL for Portuguese
TICCL has been converted to Portuguese by providing it witha Portuguese lexicon, which was the same one as used forVARD2.
Derived from the lexicon is a word confusion matrix whichin fact provides the list of all possible confusables (alsoknown as real-word errors in spelling correction).
TICCL has been equipped with absolute correction (cf.Pollock & Zamora 1984).
TICCL has been equipped with bigram correctioncapabilities: only applied to short words in this study.
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
Comparison of VARD2 and TICCL
Table: Best-first ranked performance of TICCL and VARD2 on thetokens of the test set. TICCL was trained only on the training set variantlist. VARD2 and TICCL2 were trained on both the training set variantlist and the HDBP-variant list.
Tool acc prec recall f-score
VARD2 94.65 96.99 73.63 83.71TICCL 93.25 94.27 67.96 78.98TICCL2 93.50 94.38 69.33 79.94
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
Comparison of VARD2 and TICCL - II
Table: Results on the tokens of the test set of 100 letters measuringTICCL’s 3, 5, 10 and 20 first-best ranking with bigram correction andwith absolute correction. Also shown is the effect of TICCL notperforming bigram correction. Finally, the effects of VARD2 and TICCLnot having been trained/using absolute correction with the variationlist(s)
Tool acc precision recall f-scoreTICCL-bi-rank3 94.11 94.62 72.57 82.14TICCL-bi-rank5 94.35 94.71 73.89 83.01TICCL-bi-rank10 94.55 94.78 74.92 83.69TICCL-bi-rank20 94.66 94.82 75.52 84.08
TICCL-uni-rank20 94.42 95.03 73.99 83.20
VARD2-notraining 90.58 93.79 53.05 67.77TICCL-bi-rank20-noabsolut 89.18 92.03 46.02 61.35
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
Discussion of the evaluations
Observations
Both VARD2 and TICCL benefit greatly from absolutecorrection or specific training data
TICCL goes some way towards context-sensitive spellingcorrection, but lacked contemporary bigrams from abackground corpus
TICCL has a ranking problem due to the greatermorphological variation in Portuguese. Might outperformVARD2 if this were solved.
TICCL could also be extended to productively handle largerLevenshtein distances on the basis of gold standard trainingdata, e.g. numerical anagram value for the difference between‘exmo’ and ‘excelentıssimo’ also holds for the plural andfeminine forms.
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
Future steps
What to do next?
Study the strengths of VARD and TICCL and see whether wecan combine them in one system.
(Do the right thing and) Give TICCL contemporary bigramsfrom a background corpus.
Full context-sensitive spelling correction is needed for thistype of spelling variation to raise recall above the ∼ 75%ceiling reached now.
ACRH-2 2012 Historical spelling normalisation
BackgroundData and Methods
ResultsConclusions
Thanks!!
Thanks for your attention!
Papers about TICCL are available at:http://ilk.uvt.nl/
Historical spelling normalization. A comparison oftwo statistical methods: TICCL and VARD2
Martin Reynaert, Iris Hendrickx and Rita Marquilhas
Tilburg University, The Netherlands and Centro de Linguıstica, Universidade deLisboa, Portugal
November 29, 2012
ACRH-2 2012 Historical spelling normalisation