building real-world finite-state spell-checkers with...

30
HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET UNIVERSITY OF HELSINKI Building Real-World Finite-State Spell-Checkers With HFST in FSMNLP 2012 at Donostia Tommi A Pirinen ‹tommi.pirinen@helsinki.fi› August 2, 2012 University of Helsinki

Upload: others

Post on 21-May-2020

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

HELSINGIN YLIOPISTOHELSINGFORS UNIVERSITETUNIVERSITY OF HELSINKI

Building Real-WorldFinite-StateSpell-Checkers WithHFSTin FSMNLP 2012 at Donostia

Tommi A Pirinen‹[email protected]

August 2, 2012University of HelsinkiDepartment of Modern Languages

Page 2: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Outline

Introduction

Language Models

Error Models

Misc. practicalities

Page 3: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Outline

Introduction

Language Models

Error Models

Misc. practicalities

Page 4: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Finite-State Spell-Checking

A simple task: go through all words in text tosee if they belong to language LL, if not, modifythem with relation EE to fit into language LLIn Finite-State world language model LL is any(weighted) single tape finite-state automatonrecognising the words of the languageThe error model EE is any (weighted) two-tapeautomaton, that describes spelling errors, i.e.mapping from misspelt word into the correctoneIn this tutorial we build simple WFST languagemodel and piece together error model fromdifferent error types stored in FSTs

Page 5: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

(Weighted) finite-statespell-checking graphicallyFrom this misspelled word as automaton:

30 1c

2t a

Applying this very specific error (correction) model:

30 1c

2t:a a:t

We can compose or intersect with a word in thisdictionary:

0 1c

2a

3t

5 6h

7 8s

c

4s

e

Page 6: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

With HFST Tools (but mostlygeneric FST algebra)

The tools we use to build the automata arefrom HFST project http://hfst.sf.netbut most what I show here is finite-statealgebra, which your favorite fst tools shouldsupport as wellThe chain of libraries to get HFST automataspellers to any desktop application is HFSTospell→voikko→enchant (with few exceptionalapplication not supporting this).https://kitwiki.csc.fi/twiki/bin/view/KitWiki/HfstUseAsSpellChecker

Page 7: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Outline

Introduction

Language Models

Error Models

Misc. practicalities

Page 8: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Easy baseline for languagemodel—a word-list

The most basic language model for checking ifword is correctly written is a word-form list:abacus, ..., zygotes, right?Fast way to build such word-list now: grab anonline text collection and separate it to a wordsDoing it like this, we also get: probabilities!P(w) = c(w)

CS

In the end we’d probably mangle theprobabilities into tropical logprobs that ourwfst’s use by default, say: − log P(w)

Page 9: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Compiling corpus into aspell-checker with HFST

A simple single command-line (no line breaks):tr -d ’[:punct:]’ <demo-corpus.txt | tr -s’[:space:]’ ’\n’ | sort | uniq -c| awk -f tropicalize-uniq-c.awk-assign=CS=20 | hfst-txt2fst -fopenfst -o demo-lm.hfst

In order: clean punctuation, tokenize, sort,count, log probs and compileLet’s try; you may grab corpus from , orWikipedia dump, write one on the fly if youwant to test it now

Page 10: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

A language model as a FST

Resulting language model should look like this:

0

1

f

11m

17

o 18

p 21

q

24

r

37z

40

H 44

a

52b

55

A

2i

10o

3n 4i

5t

6e

7s

8t

9a

16

t

o

12a

13c

14h

15i n e

f

19i

20n t

22u 23u n x

25u 26m

27p

28l 29e 30

s 31

t 32

i 33

l 34

t 35

s 36

k 51

i

38z 39i p

41w

42e 43

t h

45u

46t

47o 48m 49a 50

t o

n

53e

54

a

e r

Page 11: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Other language models

A word-form list is only good for very basicdemoing for mostly isolating languagesSince any given finite-state dictionary isusable, we can use—and have used—e.g.:

Xerox style morphologies (lexc, twolc, xfst)compiled with foma or hfst toolsapertium dictionarieshunspell dictionaries—with a very kludgeycollection of conversion scripts

from a morphological analyser you may wantto project off the analysis tape, we currentlywon’t use itweighting language models afterwards fromcorpora is simple, cf. for example my pastpublications for HFST scripts of that

Page 12: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Outline

Introduction

Language Models

Error Models

Misc. practicalities

Page 13: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Building error models fromscratch

Error models are two-tape automata that mapmisspelled word-forms into a correctly spelledoneBasic error types / corrections covered here:typing errors (Levenshtein–Damerau),phonemic/orthographic competence errors,common confusables word-formsOther relatively easy corrections not coveredhere: phonemic folding, errors in context. . .

Page 14: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Building error models fromscratch

Since types of errors that can appear are verydifferent, we build now separate automatacomponents for each, and then connect themtogetherHere is a good time to look back at thedictionary to see what the weights are andscale error weights accordingly using theStetson-Harrison algorithm

Page 15: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Basic component 1: Run ofcorrectly spelled charactersWe assume most misspelled words will have mostof the letters correct, to account these parts weneed the identity automaton:echo ’?*’ | hfst-regexp2fst -oanystar.hfst

0

?

Page 16: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Basic component 2: Typing error

A single typing error, as in popular Levenshteinmeasure, is a) removal, b) addition or c) change ofa character, using fixed alphabet for example with aand b, and let’s say this error is less unlikely thanany other for all combinations ( = 10):echo ’a:0::10 | 0:a::10 | a:b::10 |b:0::10 | 0:b::10 | b:a::10’ |hfst-regexp2fst -o edit1.hfst

0 1

0:a/10

a:b/10

b:?/10

a:?/10

?:?/10

b:0/10

0:b/10

b:a/10

a:0/10

?:0/10

0:?/10

?:b/10

?:a/10

Page 17: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Excursion 1: Swap inLevenshtein-Damerau measureThe fourth type of error in edit distance algorithm,transposition or swapping adjacent characters, isuseful, but makes the automaton a bit messy andtedious to write (which is why we have few scriptsto generate them). This is the additionalcomponent, that must be added per each pair ofcharacters in alphabet:echo ’a:b::10 (b:a) | b:a::10 (a:b) |hfst-regexp2fst

Page 18: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Edit distance automata

Combining the correctly written parts and asingle edit gets us a nice baseline error model:hfst-concatenate anystar.hfstedit1.hfst | hfst-concatenate -anystar.hfst -o errm1.hfst

Since it’s just an automaton, all regular tricks ofFST algebra just work, particularly:hfst-repeat -f 1 -t 7 -ierrm1.hfst -o errm7.hfst

(A script to automate writing of thecombinations of swaps: pythoneditdist.py -d 1 -S -a demo.hfst |hfst-txt2fst -o edit1.hfst)

Page 19: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Edit distance automatongraphically

0

2

a:0

a:b

??:b

b:??

b:0

b:a

??

??:0

0:??

0:a

0:b

??:a

a:??

1

??

a

b

?? a b

??

??:0

0:??

0:a

0:b

??:a

a:??

a:0

a:b

??:b

b:??

b:0

b:a

?? a b

Page 20: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Common orthography-basedcompetence errorsEnglish orthography sometimes leads to errors thatare not easily solvable by basic edit distance, so wecan extend the edit distance errors by set ofcommonly confused substrings (N.B. input tostrings2fst expects TAB between strings and weight5 i.e. bit more likely than before):hfst-strings2fst -j -o en-orth.hfstof:ough 5ier:eir 5

0

1o

5

i:e

4

7

2f:u 3

0:g 0:h

6e:i r

Page 21: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Edit Distance Plus OrthographicErrors

Combine orthographic errors with edit1 errorbefore making the full-word coveringerror-model: hfst-disjunct edit1.hfsten-orth.hfst -o edit1+en-orth.hfst

And rest as before: hfst-concatenateanystar.hfst edit1+en-orth.hfst |hfst-concatenate - anystar.hfst -oerrm.edit1+en-orth.hfst

Page 22: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Confusion set of words

Very simple, yet very important piece of errormodel, common word-specific mistakes, of coursecompiled from a list of mistakes:hfst-strings2fst -j -oen-confusables.hfstteh:the 2.5mispelled:misspelt 2.5conscienchous:conscientious 2.5

12

25

3

0

1t

13c

4

m

2e:h

10 11e:t d:0

14o

15n

16s

17c

18i

19e

20n

h:e

21c:t 22h:i

23o

24u s

5i

6s

7p:s

8e:p

9l:e l

Page 23: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Finally: hacking all thesetogether

adding one more component to the mixture;the full-word errors can be just disjuncted toany existing full error model: hfst-disjuncten-confusableserrm.edit1+en-orth.hfst -oerrm.everything.hfst

should your language happen to useproductive compounding you may wish tohfst-repeat -f 1 the full-word errorsbefore disjuncting

(The resulting automaton is too much for graphvizalready)

Page 24: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Outline

Introduction

Language Models

Error Models

Misc. practicalities

Page 25: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Getting it used in real-worldapplications

With help of few libraries it’s relatively easy toget these automata to any typical desktopapplicationlibvoikko was originally for Finnishspell-checking based on malaga (lag)grammars, and it still requires that format ofmetadata present. The installation location isalso inherited from hereHFST ospell file format was specced openly onmailing list with few FST people—result:XML-based metadata in zip archive with veryspecific filenaming conventions.

Page 26: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Metadata

Two parts of metadata: voikko-fi_FI.profile saying language code and few other detailsin MIME-headerish format (line-based, key:value)XML metadata about automatons, how to layout dictionaries and error models, also lot of UImetadata like names and descriptionsThe most meaningful setting in these files isthe language code, it must match the localesettings and the name of the locale in programsettings where applicable

Page 27: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

voikko-fi_FI.pro

info: Voikko-Dictionary-Format: 2info: Language-Code: fi_FIinfo: Language-Variant: hfstinfo: Description: Finnish HFST dictionaryinfo: Morphology-Backend: hfstinfo: Speller-Backend: hfstinfo: Suggestion-Backend: hfstinfo: Hyphenator-Backend: hfst

Page 28: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

<hfstspeller dtdversion="1.0" hfstversion="3"><info><locale>fi</locale>...

</info><acceptor type="general" id="acceptor.default.hfst">...

</acceptor><errmodel>...

</errmodel></hfstspeller>

Page 29: Building Real-World Finite-State Spell-Checkers With HFSTcomputing.dcu.ie/~tpirinen/fsmnlp-2012-spelling-tutorial.pdf · 2012-08-02 · HELSINGIN YLIOPISTO HELSINGFORS UNIVERSITET

Zipping and Installation Location

The dictionary or language model needs to bestored in acceptor.default.hfst(configurable) in special format:hfst-fst2fst -f olw -idemo-lm.hfst -oacceptor.default.hfstThe error models areerrmodel.default.hfst and likewise inspecial format: hfst-fst2fst -f olw -ierrm.everythin.hfst -oerrmodel.default.hfstThe XML metadata goes into index.xml, thenzip it to speller.zhfst like so:zip -9 speller.zhfstacceptor.default.hfsterrmodel.default.hfst index.xmlNow copy this and the voikko-fi_FI.prointo $HOME/.voikko/2/mor-hfst-ll or$prefix/share/voikko/2/mor-hfst-lland the language should just work invoikko-enabled software:mkdir -p$HOME/.voikko/2/mor-hfst-llcp speller.zhfst voikko-fi_FI.pro$HOME/.voikko/2/mor-hfst-ll