helyes írás ellenőrzés

57
Helyesírás ellenőrzés

Upload: kamea

Post on 24-Feb-2016

90 views

Category:

Documents


3 download

DESCRIPTION

Helyes írás ellenőrzés. Kiselőadások elvárt formátuma. Minden előadásról kell készüljön egy 5-7 slideos prezentáció (pl. ppt / pdf ) A prezentációt le kell adni legkésőbb az előadás első perceiben - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Helyes írás ellenőrzés

Helyesírás ellenőrzés

Page 2: Helyes írás ellenőrzés

Kiselőadások elvárt formátuma• Minden előadásról kell készüljön egy 5-7 slideos

prezentáció (pl. ppt/pdf)– A prezentációt le kell adni legkésőbb az előadás első

perceiben• A slideok a teljes cikket át kell hogy tekintsék, a

problémafelvetésétől kezdve a javasolt megoldásokon át

• Fontos hogy a rendelkezésre álló 5 percet nem szabad túllépni az előadás hosszával– 5 percnél tovább senki nem fog tudni prezentálni, ha a

prezentáció ezalatt az idő alatt nem érthető tartalmat mutat be akkor a feladatot nem teljesítettnek veszem!

Page 3: Helyes írás ellenőrzés

3

Where we are going

• We’ve looked at some more abstract ideas of language processing, and now we’re going to focus on a real-world application, detecting and correcting errors in spelling in a document

• The general techniques (e.g., probabilistic methods) will be applicable in later units, e.g., POS-tagging

• Running through such a practical application will make you more aware of the kinds of things you need to know in order to combat a problem

• The more you know about the kinds of errors people make, the more likely you can find them

Page 4: Helyes írás ellenőrzés

4

Who cares about spelling?• Aoccdrnig to a rscheearch at Cmabrigde Uinervtisy, it deosn’t

mttaer in waht oredr the ltteers in a wrod are, the olny iprmoetnt tihng is taht the frist and lsat ltteer be at the rghit pclae. The rset can be a toatl mses and you can sitll raed it wouthit porbelm. Tihs is bcuseae the huamn mnid deos not raed ervey lteter by istlef, but the wrod as a wlohe.

• (See http://www.mrc-cbu.cam.ac.uk/personal/matt.davis/Cmabrigde/ for the story behind this supposed research report.)

Page 5: Helyes írás ellenőrzés

5

Detection vs. Correction

• There are two distinct tasks:– error detection = simply find the misspelled words– error correction = correct the misspelled words

• e.g., It might be easy to tell that ater is a misspelled word, but what is the correct word? water? later? after?

• So, what causes errors?

Page 6: Helyes írás ellenőrzés

6

Spelling Tasks

• Spelling Error Detection• Spelling Error Correction:– Autocorrect • htethe

– Suggest a correction– Suggestion lists

Page 7: Helyes írás ellenőrzés

7

Types of spelling errors

• Non-word Errors– graffe giraffe

• Real-word Errors– Typographical errors• three there

– Cognitive Errors (homophones)• piecepeace, • too two

Page 8: Helyes írás ellenőrzés

8

Non-word spelling errors

• Non-word spelling error detection:– Any word not in a dictionary is an error– The larger the dictionary the better

• Non-word spelling error correction:– Generate candidates: real words that are similar

to error– Choose the one which is best:• Shortest weighted edit distance• Highest noisy channel probability

Page 9: Helyes írás ellenőrzés

9

Real word spelling errors

• For each word w, generate candidate set:– Find candidate words with similar pronunciations– Find candidate words with similar spelling– Include w in candidate set

• Choose best candidate– Noisy Channel

Page 10: Helyes írás ellenőrzés

10

Keyboard mistyping• Space bar issues

– run-on errors = two separate words become one • e.g., the fuzz becomes thefuzz

– split errors = one word becomes two separate words• e.g., equalization becomes equali zation

• Keyboard proximity– e.g., Jack becomes Hack since h, j are next to each other on a typical

American keyboard• Physical similarity

– similarity of shape, e.g., mistaking two physically similar letters when typing up something handwritten • e.g., tight for fight

Page 11: Helyes írás ellenőrzés

11

Phonetic errors• phonetic errors = errors based on the sounds of a language

(not necessarily on the letters)– homophones = two words which sound the same

• e.g., red/ read (past tense), cite/ site/ sight, they’re/ their/ there– Spoonerisms = switching two letters/sounds around – letter substitution = replacing a letter (or sequence of letters) with a

similar-sounding one• e.g., John kracked his nuckles. instead of John cracked his knuckles• e.g., I study sikologee.

Page 12: Helyes írás ellenőrzés

12

Knowledge problems

• And then there are simply cases of not knowing how to spell:– not knowing a word and guessing its spelling (can

be phonetic) • e.g., sientist

– not knowing a rule and guessing it• e.g., Do we double a consonant for ing words? Jog ->

joging ?

Page 13: Helyes írás ellenőrzés

13

What makes spelling correction difficult?

• Tokenization: What is a word?– Definition is difficult with contractions, multi-token words, hyphens,

abbreviations• Inflection: How are some words related?

– How do we store rules and exceptions?• Productivity of language: How many words are there?

– Words entering and exiting the lexicon• How we handle these issues determines how we build a

dictionary.

Page 14: Helyes írás ellenőrzés

14

Techniques used for spell checking• Non-word error detection• Isolated-word error correction• Context-dependent word error detection and

correction– grammar correction.

• The exact techniques used will differ depending on if we are looking for spelling errors in human typing or with optical character recognition (OCR)

Page 15: Helyes írás ellenőrzés

15

Non-word error detection

• non-word error detection is essentially the same thing as word recognition = splitting up “words” into true words and non-words.

• How is non-word error detection done?– Using a dictionary: Most common way to find non-

word errors– N-gram analysis: • fast and simple technique, but most typing errors are

still valid n-grams• used with OCR more than typing

Page 16: Helyes írás ellenőrzés

16

Dictionaries

• Intuition:– Have a complete list of words and check the input words

against this list.– If it’s not in the dictionary, it’s not a word.

• Two aspects:– Dictionary construction = build the dictionary (what do

you put in it?)– Dictionary lookup = lookup a potential word in the

dictionary (how do you do this quickly?)

Page 17: Helyes írás ellenőrzés

17

Dictionary construction• Do we include inflected words? i.e., words with prefixes

and suffixes already attached– Pro: lookup can be faster– Con: takes much more space, doesn’t account for new

formations• Want the dictionary to have only the word relevant for

the user -> domain-specificity• Foreign words, hyphenations, derived words, proper

nouns, and new words will always be problems for dictionaries since we cannot predict these words until humans have made them words.

• Dictionary should probably be dialectally consistent.– e.g., include only color or colour but not both

Page 18: Helyes írás ellenőrzés

18

Dictionary lookup

• Several issues arise when trying to look up a word:– Have to make lookup fast by using efficient lookup

techniques, such as a hash table – Have to strip off prefixes and suffixes if the word

isn’t an entry by itself.

Page 19: Helyes írás ellenőrzés

19

Isolated-word error correction• Having discussed how errors can be detected, we

want to know how to correct these misspelled words:– The most common method is isolated-word error

correction = correcting words without taking context into account.

– Note: This technique can only handle errors that result in non-words.

• Knowledge about what is a typical error helps in finding correct word.

Page 20: Helyes írás ellenőrzés

20

Knowledge about typical errors

• Word length effects: most misspellings are within two characters in length of original– When searching for the correct spelling, we do not

usually need to look at words with greater length differences

• First-position error effects: the first letter of a word is rarely erroneous– When searching for the correct spelling, the process is

sped up by being able to look only at words with the same first letter.

Page 21: Helyes írás ellenőrzés

21

Isolated-word error correction methods• Many different methods are used; we will briefly look at

four methods:– rule-based methods– similarity key techniques– minimum edit distance– probabilistic methods

• The methods play a role in one of the three basic steps:– 1. Detection of an error (discussed above)– 2. Generation of candidate corrections

• rule-based methods• similarity key techniques

– 3. Ranking of candidate corrections• probabilistic methods• minimum edit distance

Page 22: Helyes írás ellenőrzés

22

Rule-based methods• One can generate correct spellings by writing

rules:–Common misspelling rewritten as correct word:• e.g., hte -> the

–Rules• based on inflections:– e.g., V+C+ing -> V+CC+ing

(where V = vowel and C = consonant)• based on other common spelling errors (such as

keyboard effects or common transpositions):– e.g., Cie -> Cei

Page 23: Helyes írás ellenőrzés

23

Similarity key techniques• Problem: How can we find a list of possible corrections?• Solution: Store words in different boxes in a way that puts

the similar words together.• Example:– 1. Start by storing words by their first letter (first letter effect),

• e.g., punc starts with the code P.– 2. Then assign numbers to each letter

• e.g., 0 for vowels, 1 for b, p, f, v(all bilabials), and so forth, • e.g., punc -> P052

– 3. Then throw out all zeros and repeated letters,• e.g., P052 -> P52.

– 4. Look for real words within the same box, • e.g., punk is also in the P52 box.

Page 24: Helyes írás ellenőrzés

24

How is a mistyped word related to the intended?

• Types of errors– insertion = a letter is added to a word– deletion = a letter is deleted from a word– substitution = a letter is put in place of another one– transposition = two adjacent letters are switched

• Note that the first two alter the length of the word, whereas the second two maintain the same length.

• General properties– single-error misspellings = only one instance of an error– multi-error misspellings = multiple instances of errors (harder

to identify)

Page 25: Helyes írás ellenőrzés

25

Probabilistic methods• Two main probabilities are taken into account:– transition probabilities = probability (chance) of going

from one letter to the next.• . e.g., What is the chance that a will follow p in English? That u will

follow q?– confusion probabilities = probability of one letter being

mistaken (substituted) for another (can be derived from a confusion matrix)• e.g., What is the chance that q is confused with p?

• Useful to combine probabilistic techniques with dictionary methods

Page 26: Helyes írás ellenőrzés

26

Confusion probabilities• For the various reasons discussed above (keyboard layout,

phonetic similarity, etc.) people type other letters than the ones they intended.

• It is impossible to fully investigate all possible error causes and how they interact, but we can learn from watching how often people make errors and where.

• One way of doing so is to build a confusion matrix = a table indicating how often one letter is mistyped for another (this is a substitution matrix)

Page 27: Helyes írás ellenőrzés

27

The Noisy Channel Model• We can view the setup like this:– SOURCE: word -> NOISY CHANNEL -> noisy word– We need to decode the noisy word to figure out what the

original was• The noisy channel model has been very popular in

speech recognition, among other fields– Noisy word: O = observation (incorrect spelling)– To guess at the original word, we want to find the word (w)

which maximizes: P(w|O), i.e., the probability of w, given that O has been seen

Page 28: Helyes írás ellenőrzés

28

Non-word spelling error example

acress

Page 29: Helyes írás ellenőrzés

29

Candidate generation

• Words with similar spelling– Small edit distance to error

• Words with similar pronunciation– Small edit distance of pronunciation to error

Page 30: Helyes írás ellenőrzés

30

Damerau-Levenshtein edit distance

• Minimal edit distance between two strings, where edits are:– Insertion– Deletion– Substitution– Transposition of two adjacent letters

Page 31: Helyes írás ellenőrzés

31

Candidate generation

• 80% of errors are within edit distance 1• Almost all errors within edit distance 2

• Also allow insertion of space or hyphen– thisidea this idea– inlaw in-law

Page 32: Helyes írás ellenőrzés

32

Words within 1 of acressError Candidate

CorrectionCorrect Letter

Error Letter

Type

acress actress t - deletion

acress cress - a insertion

acress caress ca ac transposition

acress access c r substitution

acress across o e substitution

acress acres - s insertion

acress acres - s insertion

Page 33: Helyes írás ellenőrzés

33

Wait, how do you generate the candidates?

1. Run through dictionary, check edit distance with each word2. Generate all words within edit distance ≤ k (e.g., k = 1 or 2)

and then intersect them with dictionary3. Use a character k-gram index and find dictionary words

that share “most” k-grams with word (e.g., by Jaccard coefficient)

4. Compute them fast with a Levenshtein finite state transducer

5. Have a precomputed hash of words to possible corrections

Page 34: Helyes írás ellenőrzés

34

Computing error probability: confusion matrix

del[x,y]: count(xy typed as x)ins[x,y]: count(x typed as xy)sub[x,y]: count(x typed as y)trans[x,y]: count(xy typed as yx)

Insertion and deletion conditioned on previous character

Page 35: Helyes írás ellenőrzés

35

Channel model Kernighan, Church, Gale 1990

Page 36: Helyes írás ellenőrzés

36

Smoothing probabilities: Add-1 smoothing

• But if we use the last slide, unseen errors are impossible!

• They’ll make the overall probability 0. That seems too harsh– e.g., in Kernighan’s chart qa and aq are both 0,

even though they’re adjacent on the keyboard!• A simple solution is to add one to all counts and

then if there is a |A| character alphabet, to normalize appropriately:

Page 37: Helyes írás ellenőrzés

37

Noisy channel probability for acress

Candidate Correction

Correct Letter

Error Letter

x|w P(x|word) P(word) 109 *P(x|w)P(w)

actress t - c|ct .000117 .0000231 2.7

cress - a a|# .00000144 .000000544 .00078

caress ca ac ac|ca .00000164 .00000170 .0028

access c r r|c .000000209 .0000916 .019

across o e e|o .0000093 .000299 2.8

acres - s es|e .0000321 .0000318 1.0

acres - s ss|s .0000342 .0000318 1.0

Page 38: Helyes írás ellenőrzés

38

Using a bigram language model

• “a stellar and versatile acress whose combination of sass and glamour…”

• Counts from the Corpus of Contemporary American English with add-1 smoothing

• P(actress|versatile)=.000021 P(whose|actress) = .0010• P(across|versatile) =.000021 P(whose|across) = .000006

• P(“versatile actress whose”) = .000021*.0010 = 210 x10-10

• P(“versatile across whose”) = .000021*.000006 = 1 x10-10

Page 39: Helyes írás ellenőrzés

39

Real-word spelling errors

• …leaving in about fifteen minuets to go to her house.• The design an construction of the system…• Can they lave him my messages?• The study was conducted mainly be John Black.

• 25-40% of spelling errors are real words Kukich 1992

Page 40: Helyes írás ellenőrzés

40

Noisy channel for real-word spell correction

Page 41: Helyes írás ellenőrzés

Categorization Problem : Spelling

1. Field2. Wield3. Shield4. Deceive5. Receive6. Ceiling

Page 42: Helyes írás ellenőrzés

Rule-based Approach

“I before E except after C”

-- an example of a linguistic insight

Page 43: Helyes írás ellenőrzés

Probabilistic Statistical Model:

• Count the occurrences of ‘ie’ and ‘ei’ and ‘cie’ and ‘cei’ in a large corpus

P(IE) = 0.0177P(EI) = 0.0046P(CIE) = 0.0014P(CEI) = 0.0005

Page 44: Helyes írás ellenőrzés

Words where ie occur after c

• science• society• ancient• species

Page 45: Helyes írás ellenőrzés

Chomsky’s Argument

• A completely new sentence must have a probability of 0, since it is an outcome that has not been seen.

• Since novel sentences are in fact generated all the time, there is a contradiction.

Page 46: Helyes írás ellenőrzés

• Probabilities should broadly indicate likelihood of sentences – P(I saw a van) >> P(eyes awe of an)

Probabilistic language model

Courtesy Dan Klein

Page 47: Helyes írás ellenőrzés

Spell Checking

Let’s say the word hand is mistyped

Hand --- *Hamd

There you have an unknown word!

Page 48: Helyes írás ellenőrzés

Spell Checking

Out-of-Vocabulary Error

*Hamd

Page 49: Helyes írás ellenőrzés

Spell Checking

*Hamd

Hand

Page 50: Helyes írás ellenőrzés

Spell Checking

*Hamd

Hand Hard

Page 51: Helyes írás ellenőrzés

What we need to find …

• P(hard|hamd)• P(hand|hamd)

• Whichever is greater is the right one!

Page 52: Helyes írás ellenőrzés

What we need to find …

• But how do you find the probability P(hard|hamd) when ‘hamd’ is an *unknown word*?

Page 53: Helyes írás ellenőrzés

• P(E|F) = P(F|E) * P(E) / P(F)

• P(hand|hamd) = P(hamd|hand)*P(hand)

P(hamd|hand) = P(h|h)P(a|a)P(m|n)P(d|d)

• P(hand|hamd) = P(m|n)*P(hand)

BAYES RULE

The stuff in green is approximately 1, so ignore it!

Page 54: Helyes írás ellenőrzés

So we have what we needed to find …

• P(hard|hamd) = P(m|n)*P(hand)• P(hand|hamd) = P(m|r)*P(hard)

• Whichever is greater is the right one!

Page 55: Helyes írás ellenőrzés

Note: Use of Unigram Probabilities

• The P(m|n) part is called the channel probability

• The P(hand) part is called the prior probability• Kernighan’s experiment showed that P(hand)

is very important (causes a 7% improvement)!

Page 56: Helyes írás ellenőrzés

Spell Checking

• How do you tell whether an unknown word is a new word or a spelling mistake?

• How can you tell “Shivaswamy” is not a wrong spelling of “Shuddering”

• You can use statistics

Page 57: Helyes írás ellenőrzés

T-test

• Calculate the probability of “Shivaswamy” having come about by errors made on “Shuddering” … that is, 8 errors happening in 10 letters.

• P(error) = 0.8• Say the average letter error rate is 0.1.• Verify using the t-test that Shivaswamy is not

likely to have come about from Shuddering by a 0.1 error-per-letter process.

• Use the word length (10) as n.