ling/c sc/psyc 438/538 lecture 24 sandiway fong 1

75
LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Upload: jeffry-clement-gallagher

Post on 20-Jan-2018

213 views

Category:

Documents


0 download

DESCRIPTION

Last Time 1.Homework 8 review 2.CKY algorithm for parsing context-free grammars – Chomsky Normal Form (CNF) – 2D table w1w1 w2w2 w3w3 [0,1 ] x1 [0,2] y [0,3] s s [1,2] x2 [1,3] z [2,3] x3 s --> y, x3. s --> x1, z. s --> y, z. y --> x1, x2. z --> x2, x3. x1 --> [w1]. x2 --> [w2]. x3 --> [w3].

TRANSCRIPT

Page 1: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

1

LING/C SC/PSYC 438/538

Lecture 24Sandiway Fong

Page 2: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Adminstrivia

• Homeworks 7 and 8 graded

Page 3: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Last Time

1. Homework 8 review2. CKY algorithm for parsing context-free grammars– Chomsky Normal Form (CNF)– 2D table

w1 w2 w3

[0,1] x1 [0,2] y [0,3] s s

[1,2] x2 [1,3] z

[2,3] x3

s --> y, x3.s --> x1, z.s --> y, z.y --> x1, x2.z --> x2, x3.x1 --> [w1].x2 --> [w2].x3 --> [w3].

Page 4: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Adminstrivia

• TCEs are fully online (paper TCEs are history)

https://tce.oirps.arizona.edu/TCEOnline

Page 5: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Spelling Errors

• Sections 3.10 and 5.9

Page 6: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Spelling Errors• Textbook cites (Kukich, 1992):

– Non-word detection (easiest)• graffe (giraffe)

– Isolated-word (context-free) error correction• graffe (giraffe,…)• graffed (gaffed,…)• by definition cannot correct when

error word is a valid word– Context-dependent error detection

and correction (hardest)• your an idiot you’re an idiot• Their is there is• (Microsoft Word corrects this by

default)

Page 7: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Spelling Errors• OCR

– visual similarity• hb, ec , jumpjurnps

• Typing– keyboard distance

• smallsmsll, spellspel;

• Graffiti (many HCI studies)– stroke similarity

• Common error characters are: V, T, 4, L, E, Q, K, N, Y, 9, P, G, X• Two stroke characters: B, D, P(error: two characters)

• Cognitive Errors– bad spellers

• separateseperate

Page 8: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

correct

– textbook section 5.9• Kernighan et al. (correct)

– take typo t (not a word)• mutate t minimally by deleting, inserting, substituting or transposing

(swapping) a letter• look up “mutated t“ in a dictionary• candidates are “mutated t“ that are real words

– example (5.2)• t = acress• C = {actress,cress, caress, access, across, acres, acres}

Page 9: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

correct• formula

– = max c C P(t|c) P(c) (Bayesian Inference)– C = {actress, cress, caress, access, across, acres, acres}

• Prior: P(c)– estimated using frequency information over a large corpus (N

words)– P(c) = freq(c)/N– P(c) = freq(c)+0.5/(N+0.5V)

• avoid zero counts (non-occurrences)• (add fractional part 0.5)• add one (0.5) smoothing • V is vocabulary size of corpus

Page 10: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

correct

• Likelihood: P(t|c)– using some corpus of errors– compute following 4 confusion matrices– del[x,y] = freq(correct xy mistyped as x)– ins[x,y] = freq(correct x mistyped as xy)– sub[x,y] = freq(correct x mistyped as y)– trans[x,y] = freq(correct xy mistyped as yx)– P(t|c) = del[x,y]/f(xy) if c related to t by deletion of y– P(t|c) = ins[x,y]/f(x) if c related to t by insertion of y etc…

probability of typo t given candidate word c

26 x 26matrix

a–z

a–z

Veryhardto collectthis data

Page 11: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

correct• example

– t = acress – = acres (44%)

• despite all the math• wrong result for

– was called a stellar and versatile acress

• what does Microsoft Word use?– was called a stellar and versatile

acress

Page 12: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Microsoft Word

corrected here

Page 13: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 14: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 15: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 16: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 17: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 18: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 19: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 20: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 21: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 22: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 23: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 24: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 25: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Google

Page 26: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Part 2

• Another algorithm using dynamic programming:– Minimum Edit Distance• Textbook: section 3.11• File: eds.xls

Page 27: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

27

Minimum Edit Distance

• general string comparison• edit operations are insertion, deletion and substitution• not just limited to distance defined by a single operation away• we can ask how different is string a from b by the minimum edit distance

Page 28: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

28

Minimum Edit Distance• applications

– could be used for multi-typo correction– used in Machine Translation Evaluation (MTEval)– example

• Source: 生産工程改善について• Translations:• (Standard) For improvement of the production process• (MT-A) About a production process betterment• (MT-B) About the production process improvement• method

– compute edit distance between MT-A and Standard and MT-B and Standard in terms of word insertion/substitution etc.

Page 29: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

29

Minimum Edit Distance

• cost models– Levenshtein

• insertion, deletion and substitution all have unit cost

– Levenshtein (alternate)• insertion, deletion have unit cost• substitution is twice as expensive• substitution = one insert followed by one

delete

– Typewriter• insertion, deletion and substitution all

have unit cost• modified by key proximity

Page 30: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance

• Dynamic Programming– divide-and-conquer

• to solve a problem we divide it into sub-problems– sub-problems may be repeated

• don’t want to re-solve a sub-problem the 2nd time around– idea: put solutions to sub-problems in a table

• and just look up the solution 2nd time around, thereby saving time• memoization

we’ll use a spreadsheet…

Page 31: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance

• Consider a simple case: xy yx⇄

• Minimum # of operations: • insert and delete• cost = 2

• Minimum # of operations: • swap• cost = ?

Page 32: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance

• Generally

Page 33: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance• Programming Practice: could be easily implemented in Perl

Page 34: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance

• Generally

Page 35: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance Computation

• Or in Microsoft Excel, file: eds.xls (on course webpage)

$ in a cell referencemeans don’t change when copiedfrom cell to celle.g. in C$11 stays the samein $A3A stays the same

Page 36: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance

• Task: transform string s1..si into string t1..tj

– each sn and tn are letters– string s is of length i, t is of length j

• Example: – s = leader, t = adapter– i = 6, j = 7– Let’s say you’re allowed just three operations: (1) delete a

letter, (2) insert a letter, or (3) substitute a letter for another letter

– What is one possible way to generate t from s?

Page 37: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance

• Example: – s = leader, t = adapter– What is one possible way to generate t from s?– leader– ↕ ↕ – adapter– cost is 2 deletes and 3 inserts, total 5 operations– Question: is this the minimum possible?

leader◄leade◄lead◄lea◄le◄l◄◄a◄ad◄ada◄adap◄adapt◄adapte◄adapter◄Simplest methodcost: 13 operations

Page 38: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

Page 39: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

cell (2,3)cost of

transforming le into ada

Page 40: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

cell (2,3)cost of

transforming le into ada

cell (6,7)cost of

transforming leader into adapter

Page 41: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

cell (3,0)cost of

transforming lea into (empty)

Page 42: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

cell (0,4)cost of

transforming (empty) into adap

Page 43: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e k

6 r

cell (5,6)cost of

transforming leade into adapte

Page 44: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e k

6 r

cell (5,6)cost of

transforming leade into adapte

➡︎

Page 45: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e k

6 r k

cell (5,6)cost of

transforming leade into adapte

Page 46: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e k

3 a4 d5 e6 r

cell (2,3)cost of

transforming le into ada

Page 47: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e k

3 a4 d5 e6 r

cell (2,3)cost of

transforming le into ada

cell (2,4)cost of

transforming le into adap

➡︎

Page 48: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e k k+1

3 a4 d5 e6 r

cell (2,3)cost of

transforming le into ada

cell (2,4)cost of

transforming le into adap

➡︎

l e

a d a p

Page 49: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l k

2 e3 a4 d5 e6 r

cell (1,4)cost of

transforming l into adap

➡︎

Page 50: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l k

2 e k+1

3 a4 d5 e6 r

cell (1,4)cost of

transforming l into adap

➡︎

l e

a d a p

Page 51: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l k

2 e3 a4 d5 e6 r

cell (1,3)cost of

transforming l into ada

➡︎

Page 52: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l k

2 e k+2

3 a4 d5 e6 r

cell (1,3)cost of

transforming l into ada

➡︎

assuming the cost of swapping e for p

is 2

l e

a d a p

Page 53: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l k1,3 k1,4

2 e k2,3 ?

3 a4 d5 e6 r

➡︎

➡︎

➡︎

cell (2,4)minimum of the

three costs to get here in one step

Page 54: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0

1 l2 e3 a4 d5 e6 r

cell (3,0)cost of

transforming lea into (empty)

Page 55: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0

1 l2 e3 a4 d5 e6 r

Page 56: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0

1 l 1

2 e3 a4 d5 e6 r

Page 57: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0

1 l 1

2 e 2

3 a4 d5 e6 r

➡︎

cost of le =cost of l , plus the cost of deleting the e

Page 58: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5

6 r 6

Page 59: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0

1 l2 e3 a4 d5 e6 r

Page 60: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1

1 l2 e3 a4 d5 e6 r

Page 61: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l2 e3 a4 d5 e6 r

Page 62: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5

6 r 6

Page 63: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5

6 r 6

Page 64: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5

6 r 6

➡︎

➡︎

➡︎

Page 65: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1 2

2 e 2

3 a 3

4 d 4

5 e 5

6 r 6

➡︎

➡︎

➡︎

Page 66: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4 5 6

5 e 5 6

6 r 6

Page 67: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4 5 6

5 e 5 6

6 r 6

➡︎

Page 68: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4 5 6

5 e 5 6 5

6 r 6

➡︎

Page 69: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2 6 7

3 a 3 5

4 d 4

5 e 5

6 r 6

Page 70: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2 6 7

3 a 3 5

4 d 4

5 e 5

6 r 6

➡︎

Page 71: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2 6 7

3 a 3 5 6

4 d 4

5 e 5

6 r 6

➡︎

Page 72: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5 6 5

6 r 6 7

Page 73: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5 6 5

6 r 6 7

➡︎

Page 74: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance0 1 2 3 4 5 6 7

a d a p t e r0 0 1 2 3 4 5 6 7

1 l 1

2 e 2

3 a 3

4 d 4

5 e 5 6 5

6 r 6 7 6

➡︎

Page 75: LING/C SC/PSYC 438/538 Lecture 24 Sandiway Fong 1

Minimum Edit Distance