dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · recap...

116
Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Dictionaries and tolerant retrieval Most slides are from Prof. Schütze, Center for Information and Language Processing, University of Munich September 24, 2018 1 / 115

Upload: others

Post on 21-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Dictionaries and tolerant retrieval

Most slides are from Prof. Schütze, Center for Information and LanguageProcessing, University of Munich

September 24, 2018

1 / 115

Page 2: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Overview

1 Recap

2 Dictionaries

3 Wildcard queries

4 Edit distance

5 Spelling correction

6 Soundex

2 / 115

Page 3: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

1 Recap

2 Dictionaries

3 Wildcard queries

4 Edit distance

5 Spelling correction

6 Soundex

3 / 115

Page 4: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Type/token distinction

Token – an instance of a word or term occurring in adocumentType – an equivalence class of tokensIn June, the dog likes to chase the cat in the barn.12 word tokens, 9 word typesIn(1) June(2) the(3) dog(4) likes(5) to(6) chase(7)[the] cat(8) [in] [the] barn(9).

4 / 115

Page 5: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Problems in tokenization

What are the delimiters? Space? Apostrophe? Hyphen?For each of these: sometimes they delimit, sometimes theydon’t.No whitespace in many languages! (e.g., Chinese)No whitespace in Dutch, German, Swedish compounds(Lebensversicherungsgesellschaftsangestellter)

5 / 115

Page 6: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Problems with equivalence classing

A term is an equivalence class of tokens.How do we define equivalence classes?Numbers (3/20/91 vs. 20/3/91)Case foldingStemming, Porter stemmerMorphological analysis: inflectional vs. derivationalEquivalence classing problems in other languages

More complex morphology than in EnglishFinnish: a single verb may have 12,000 different formsAccents, umlauts

6 / 115

Page 7: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Skip pointers

16 28 72

5 51 98

2 4 8 16 19 23 28 43

1 2 3 5 8 41 51 60 71

Brutus

Caesar

7 / 115

Page 8: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Positional indexesnonpositional index: each posting is just a docIDpositional index: each posting is a docID and a list of positionsExample query: “to1 be2 or3 not4 to5 be6”

to, 993427:⟨ 1: ⟨7, 18, 33, 72, 86, 231⟩;2: ⟨1, 17, 74, 222, 255⟩;4: ⟨8, 16, 190, 429, 433⟩;5: ⟨363, 367⟩;7: ⟨13, 23, 191⟩; …⟩

be, 178239:⟨ 1: ⟨17, 25⟩;4: ⟨17, 191, 291, 430, 434⟩;5: ⟨14, 19, 101⟩; …⟩

Document 4 is a match!

8 / 115

Page 9: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Positional indexes

With a positional index, we can answer phrase queries.With a positional index, we can answer proximity queries.

9 / 115

Page 10: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Take-away

Tolerant retrieval: What to do if there is no exact matchbetween query term and document termWildcard queriesSpelling correction

10 / 115

Page 11: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

1 Recap

2 Dictionaries

3 Wildcard queries

4 Edit distance

5 Spelling correction

6 Soundex

11 / 115

Page 12: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Inverted index

For each term t, we store a list of all documents that contain t.

Brutus −→ 1 2 4 11 31 45 173 174

Caesar −→ 1 2 4 5 6 16 57 132 …

Calpurnia −→ 2 31 54 101

...︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings

12 / 115

Page 13: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Inverted index

For each term t, we store a list of all documents that contain t.

Brutus −→ 1 2 4 11 31 45 173 174

Caesar −→ 1 2 4 5 6 16 57 132 …

Calpurnia −→ 2 31 54 101

...︸ ︷︷ ︸ ︸ ︷︷ ︸dictionary postings

12 / 115

Page 14: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Dictionaries

The dictionary is the data structure for storing the termvocabulary.Term vocabulary: the dataDictionary: the data structure for storing the term vocabulary

13 / 115

Page 15: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Dictionary as array of fixed-width entries

For each term, we need to store a couple of items:document frequencypointer to postings list…

Assume for the time being that we can store this informationin a fixed-length entry.Assume that we store these entries in an array.

14 / 115

Page 16: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Dictionary as array of fixed-width entries

term documentfrequency

pointer topostings list

a 656,265 −→aachen 65 −→… … …zulu 221 −→

space needed: 20 bytes 4 bytes 4 bytes

How do we look up a query term qi in this array at query time?That is: which data structure do we use to locate the entry (row)in the array where qi is stored?

15 / 115

Page 17: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Data structures for looking up term

Two main classes of data structures: hashes and treesSome IR systems use hashes, some use trees.Criteria for when to use hashes vs. trees:

Is there a fixed number of terms or will it keep growing?What are the relative frequencies with which various keys willbe accessed?How many terms are we likely to have?

16 / 115

Page 18: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Hashes

fig from wikipedia

17 / 115

Page 19: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Hashes

Each vocabulary term is hashed into an integer, its rownumber in the arrayAt query time: hash query term, locate entry in fixed-widtharrayPros: Lookup in a hash is faster than lookup in a tree.

Lookup time is constant.Cons

no way to find minor variants (resume vs. résumé)no prefix search (all terms starting with automat)need to rehash everything periodically if vocabulary keepsgrowing

18 / 115

Page 20: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Trees

Trees solve the prefix problem (find all terms starting withautomat).Simplest tree: binary treeSearch is slightly slower than in hashes: O(logM), where M isthe size of the vocabulary.O(logM) only holds for balanced trees.Rebalancing binary trees is expensive.B-trees mitigate the rebalancing problem.B-tree definition: every internal node has a number ofchildren in the interval [a, b] where a, b are appropriatepositive integers, e.g., [2, 4].

19 / 115

Page 21: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Binary tree

20 / 115

Page 22: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

B-tree

a generalization of binary search treeallow nodes to have more than 2 children

21 / 115

Page 23: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

1 Recap

2 Dictionaries

3 Wildcard queries

4 Edit distance

5 Spelling correction

6 Soundex

22 / 115

Page 24: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Wildcard queries

mon*: find all docs containing any term beginning with monEasy with B-tree dictionary: retrieve all terms t in the range:mon ≤ t < moo*mon: find all docs containing any term ending with mon

Maintain an additional tree for terms backwardsThen retrieve all terms t in the range: nom ≤ t < non

Result: A set of terms that are matches for wildcard queryThen retrieve documents that contain any of these terms

23 / 115

Page 25: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

How to handle * in the middle of a term

Example: m*nWe could look up m* and *n in the B-tree and intersect thetwo term sets.ExpensiveAlternative: permuterm indexBasic idea: Rotate every wildcard query, so that the * occursat the end.Store each of these rotations in the dictionary, say, in a B-tree

24 / 115

Page 26: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Permuterm index

For term hello: add thefollowing to the B-treewhere $ is a special symbol

hello$,ello$h,llo$he,lo$hel,o$hell,$hello

25 / 115

Page 27: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Permuterm index

For hello, we’ve stored: hello$, ello$h, llo$he, lo$hel, o$hell,$helloQueries

For X, look up X$For X*, look up $X*For *X, look up X$*For *X*, look up X*For X*Y, look up Y$X*Example: For hel*o, look up o$hel*

Permuterm index would better be called a permuterm tree.But permuterm index is the more common name.

26 / 115

Page 28: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Processing a lookup in the permuterm index

Rotate query wildcard to the rightUse B-tree lookup as beforeProblem: Permuterm more than quadruples the size of thedictionary compared to a regular B-tree. (empirical number)

27 / 115

Page 29: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

k-gram indexes

More space-efficient than permuterm indexEnumerate all character k-grams (sequence of k characters)occurring in a term2-grams are called bigrams.

april → ap pr ri il l$April is the cruelest month−→ $a ap pr ri il l$ $i is s$ $t th he e$ $c cr ruue el le es st t$ $m mo on nt h$

$ is a special word boundary symbol.Maintain an inverted index from bigrams to the terms thatcontain the bigram

28 / 115

Page 30: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

why k-gram is more space efficient

permuterm of hello→ hello$, ello$h, llo$he, lo$hel, o$hell

2-grams of hello→ he, el, ll, lo, o$

29 / 115

Page 31: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Postings list in a 3-gram inverted index

etr beetroot metric petrify retrieval- - - -

30 / 115

Page 32: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

k-gram (bigram, trigram, …) indexes

Two different types of inverted indexes:Term-document inverted index: for finding documents basedon a query consisting of termsk-gram index: for finding terms based on a query consisting ofk-grams

31 / 115

Page 33: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Processing wildcarded terms in a bigram index

Query mon* can now be run as:$m and mo and onGets us all terms with the prefix mon ……but also many “false positives” like moon.We must post-filter these terms against query.Surviving terms are then looked up in the term-documentinverted index.k-gram index vs. permuterm index

k-gram index is more space efficient.Permuterm index doesn’t require post-filtering.

32 / 115

Page 34: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

1 Recap

2 Dictionaries

3 Wildcard queries

4 Edit distance

5 Spelling correction

6 Soundex

33 / 115

Page 35: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Edit distance

The edit distance between string s1 and string s2 is theminimum number of basic operations that convert s1 to s2.Levenshtein distance: The admissible basic operations are

insert, cost =1delete, cost =1replace, cost=1copy, cost=0

Levenshtein distance

s1 s2 operation costdog do insert 1cat cart insert 1cat cut replace 1cat act delete, insert 2

34 / 115

Page 36: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

there are other distance definitions

Damerau-Levenshtein distance cat-act: 1includes transposition operation.

35 / 115

Page 37: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

problem definition

For two stringsX of length nY of length m

D(i,j)the edit distance between X[1..i] and Y[1..j]score of the best alignment from X[1..i] to Y[1..j]i.e., the first i characters of X and the first j characters of YThe edit distance between X and Y is thus D(n,m)

Properties for D(i,j)D(i,0)=i; delete i lettersD(0,j)=j; insert j letters

36 / 115

Page 38: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

recurrence relation

D(i, j) =

D(i− 1, j− 1) + d(Xi,Yj); replace or copyD(i− 1, j) + 1; insertD(i, j− 1) + 1; delete

(1)

d(x, y) ={0; if x=y1; otherwise

(2)

37 / 115

Page 39: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Edit distance using dynamic programming

Dynamic programming: A tabular computation of D(n,m)Solving problems by combining solutions to subproblems.Bottom-up

We compute D(i,j) for small i,jAnd compute larger D(i,j) based on previously computedsmaller valuesi.e., compute D(i,j) for all i (0 < i < n) and j (0 < j < m)

38 / 115

Page 40: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Levenshtein distance: Computation

f a s t0 1 2 3 4

c 1 1 2 3 4a 2 2 1 2 3t 3 3 2 2 2s 4 4 3 2 3

39 / 115

Page 41: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Levenshtein distance: Algorithm

LevenshteinDistance(s1, s2)1 for i← 0 to |s1|2 do m[i, 0] = i3 for j← 0 to |s2|4 do m[0, j] = j5 for i← 1 to |s1|6 do for j← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i, j] = min{m[i-1, j]+1,m[i, j-1]+1,m[i-1, j-1]}9 else m[i, j] = min{m[i-1, j]+1,m[i, j-1]+1,m[i-1, j-1]+1}10 return m[|s1|, |s2|]

Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy(cost 0)

40 / 115

Page 42: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Levenshtein distance: Algorithm

LevenshteinDistance(s1, s2)1 for i← 0 to |s1|2 do m[i, 0] = i3 for j← 0 to |s2|4 do m[0, j] = j5 for i← 1 to |s1|6 do for j← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i, j] = min{m[i-1, j]+1,m[i, j-1]+1,m[i-1, j-1]}9 else m[i, j] = min{m[i-1, j]+1,m[i, j-1]+1,m[i-1, j-1]+1}10 return m[|s1|, |s2|]

Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy(cost 0)

41 / 115

Page 43: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Levenshtein distance: Algorithm

LevenshteinDistance(s1, s2)1 for i← 0 to |s1|2 do m[i, 0] = i3 for j← 0 to |s2|4 do m[0, j] = j5 for i← 1 to |s1|6 do for j← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i, j] = min{m[i-1, j]+1,m[i, j-1]+1,m[i-1, j-1]}9 else m[i, j] = min{m[i-1, j]+1,m[i, j-1]+1,m[i-1, j-1]+1}10 return m[|s1|, |s2|]

Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy(cost 0)

42 / 115

Page 44: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Levenshtein distance: Algorithm

LevenshteinDistance(s1, s2)1 for i← 0 to |s1|2 do m[i, 0] = i3 for j← 0 to |s2|4 do m[0, j] = j5 for i← 1 to |s1|6 do for j← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i, j] = min{m[i-1, j]+1,m[i, j-1]+1,m[i-1, j-1]}9 else m[i, j] = min{m[i-1, j]+1,m[i, j-1]+1,m[i-1, j-1]+1}10 return m[|s1|, |s2|]

Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy(cost 0)

43 / 115

Page 45: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Levenshtein distance: Algorithm

LevenshteinDistance(s1, s2)1 for i← 0 to |s1|2 do m[i, 0] = i3 for j← 0 to |s2|4 do m[0, j] = j5 for i← 1 to |s1|6 do for j← 1 to |s2|7 do if s1[i] = s2[j]8 then m[i, j] = min{m[i-1, j]+1,m[i, j-1]+1,m[i-1, j-1]}9 else m[i, j] = min{m[i-1, j]+1,m[i, j-1]+1,m[i-1, j-1]+1}10 return m[|s1|, |s2|]

Operations: insert (cost 1), delete (cost 1), replace (cost 1), copy(cost 0)

44 / 115

Page 46: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Levenshtein distance: Example

f a s t

0 1 1 2 2 3 3 4 4

c 11

1 22 1

2 32 2

3 43 3

4 54 4

a 22

2 23 2

1 33 1

3 42 2

4 53 3

t 33

3 34 3

3 24 2

2 33 2

2 43 2

s 44

4 45 4

4 35 3

2 34 2

3 33 3

45 / 115

Page 47: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Each cell of Levenshtein matrix

cost of getting here frommy upper left neighbor(copy or replace)

cost of getting herefrom my upper neighbor(delete)

cost of getting here frommy left neighbor (insert)

the minimum of thethree possible “move-ments”; the cheapestway of getting here

46 / 115

Page 48: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Levenshtein distance: Example

f a s t

0 1 1 2 2 3 3 4 4

c 11

1 22 1

2 32 2

3 43 3

4 54 4

a 22

2 23 2

1 33 1

3 42 2

4 53 3

t 33

3 34 3

3 24 2

2 33 2

2 43 2

s 44

4 45 4

4 35 3

2 34 2

3 33 3

47 / 115

Page 49: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Dynamic programming (Cormen et al.)

Optimal substructure: The optimal solution to the problemcontains within it subsolutions, i.e., optimal solutions tosubproblems.Overlapping subsolutions: The subsolutions overlap. Thesesubsolutions are computed over and over again whencomputing the global optimal solution in a brute-forcealgorithm.Subproblem in the case of edit distance: what is the editdistance of two prefixesOverlapping subsolutions: We need most distances of prefixes3 times – this corresponds to moving right, diagonally, down.

48 / 115

Page 50: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Weighted edit distance

Weight of an operation depends on the characters involved.Meant to capture keyboard errors, e.g., m more likely to bemistyped as n than as q.Therefore, replacing m by n is a smaller edit distance than byq.We now require a weight matrix as input.Modify dynamic programming to handle weights

49 / 115

Page 51: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

50 / 115

Page 52: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Using edit distance for spelling correction

Given query, first enumerate all character sequences within apreset (possibly weighted) edit distanceIntersect this set with our list of “correct” wordsThen suggest terms in the intersection to the user.

51 / 115

Page 53: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Exercise

1 Compute Levenshtein distance matrix for oslo – snow2 What are the Levenshtein editing operations that transform

cat into catcat?

52 / 115

Page 54: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

s 22

l 33

o 44

53 / 115

Page 55: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 ?

s 22

l 33

o 44

54 / 115

Page 56: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

s 22

l 33

o 44

55 / 115

Page 57: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 ?

s 22

l 33

o 44

56 / 115

Page 58: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

s 22

l 33

o 44

57 / 115

Page 59: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 ?

s 22

l 33

o 44

58 / 115

Page 60: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

s 22

l 33

o 44

59 / 115

Page 61: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 ?

s 22

l 33

o 44

60 / 115

Page 62: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

l 33

o 44

61 / 115

Page 63: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 ?

l 33

o 44

62 / 115

Page 64: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

l 33

o 44

63 / 115

Page 65: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 ?

l 33

o 44

64 / 115

Page 66: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

l 33

o 44

65 / 115

Page 67: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 ?

l 33

o 44

66 / 115

Page 68: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

l 33

o 44

67 / 115

Page 69: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 ?

l 33

o 44

68 / 115

Page 70: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

o 44

69 / 115

Page 71: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 ?

o 44

70 / 115

Page 72: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

o 44

71 / 115

Page 73: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 ?

o 44

72 / 115

Page 74: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

o 44

73 / 115

Page 75: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 ?

o 44

74 / 115

Page 76: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

o 44

75 / 115

Page 77: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 ?

o 44

76 / 115

Page 78: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

77 / 115

Page 79: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 ?

78 / 115

Page 80: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

79 / 115

Page 81: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 ?

80 / 115

Page 82: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

81 / 115

Page 83: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

2 44 ?

82 / 115

Page 84: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

2 44 2

83 / 115

Page 85: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

2 44 2

4 53 ?

84 / 115

Page 86: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

2 44 2

4 53 3

85 / 115

Page 87: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

2 44 2

4 53 3

86 / 115

Page 88: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

2 44 2

4 53 3

How do I read out the editing operations that transform oslo into snow?

87 / 115

Page 89: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

2 44 2

4 53 3

cost operation input output1 insert * w

88 / 115

Page 90: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

2 44 2

4 53 3

cost operation input output0 (copy) o o1 insert * w

89 / 115

Page 91: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

2 44 2

4 53 3

cost operation input output1 replace l n0 (copy) o o1 insert * w

90 / 115

Page 92: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

2 44 2

4 53 3

cost operation input output0 (copy) s s1 replace l n0 (copy) o o1 insert * w

91 / 115

Page 93: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

s n o w

0 1 1 2 2 3 3 4 4

o 11

1 22 1

2 32 2

2 43 2

4 53 3

s 22

1 23 1

2 32 2

3 33 3

3 44 3

l 33

3 24 2

2 33 2

3 43 3

4 44 4

o 44

4 35 3

3 34 3

2 44 2

4 53 3

cost operation input output1 delete o *0 (copy) s s1 replace l n0 (copy) o o1 insert * w

92 / 115

Page 94: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

c a t c a t

0 1 1 2 2 3 3 4 4 5 5 6 6

c 11

0 22 0

2 31 1

3 42 2

3 53 3

5 64 4

6 75 5

a 22

2 13 1

0 22 0

2 31 1

3 42 2

3 53 3

5 64 4

t 33

3 24 2

2 13 1

0 22 0

2 31 1

3 42 2

3 53 3

93 / 115

Page 95: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

c a t c a t

0 1 1 2 2 3 3 4 4 5 5 6 6

c 11

0 22 0

2 31 1

3 42 2

3 53 3

5 64 4

6 75 5

a 22

2 13 1

0 22 0

2 31 1

3 42 2

3 53 3

5 64 4

t 33

3 24 2

2 13 1

0 22 0

2 31 1

3 42 2

3 53 3

cost operation input output1 insert * c1 insert * a1 insert * t0 (copy) c c0 (copy) a a0 (copy) t t

94 / 115

Page 96: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

c a t c a t

0 1 1 2 2 3 3 4 4 5 5 6 6

c 11

0 22 0

2 31 1

3 42 2

3 53 3

5 64 4

6 75 5

a 22

2 13 1

0 22 0

2 31 1

3 42 2

3 53 3

5 64 4

t 33

3 24 2

2 13 1

0 22 0

2 31 1

3 42 2

3 53 3

cost operation input output0 (copy) c c1 insert * a1 insert * t1 insert * c0 (copy) a a0 (copy) t t

95 / 115

Page 97: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

c a t c a t

0 1 1 2 2 3 3 4 4 5 5 6 6

c 11

0 22 0

2 31 1

3 42 2

3 53 3

5 64 4

6 75 5

a 22

2 13 1

0 22 0

2 31 1

3 42 2

3 53 3

5 64 4

t 33

3 24 2

2 13 1

0 22 0

2 31 1

3 42 2

3 53 3

cost operation input output0 (copy) c c0 (copy) a a1 insert * t1 insert * c1 insert * a0 (copy) t t

96 / 115

Page 98: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

c a t c a t

0 1 1 2 2 3 3 4 4 5 5 6 6

c 11

0 22 0

2 31 1

3 42 2

3 53 3

5 64 4

6 75 5

a 22

2 13 1

0 22 0

2 31 1

3 42 2

3 53 3

5 64 4

t 33

3 24 2

2 13 1

0 22 0

2 31 1

3 42 2

3 53 3

cost operation input output0 (copy) c c0 (copy) a a0 (copy) t t1 insert * c1 insert * a1 insert * t

97 / 115

Page 99: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

1 Recap

2 Dictionaries

3 Wildcard queries

4 Edit distance

5 Spelling correction

6 Soundex

98 / 115

Page 100: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Spelling correction

Two principal usesCorrecting documents being indexedCorrecting user queries

Two different methods for spelling correctionIsolated word spelling correction

Check each word on its own for misspellingWill not catch typos resulting in correctly spelled words, e.g.,an asteroid that fell form the sky

Context-sensitive spelling correctionLook at surrounding wordsCan correct form/from error above

99 / 115

Page 101: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Correcting documents

We’re not interested in interactive spelling correction ofdocuments (e.g., MS Word) in this class.In IR, we use document correction primarily for OCR’eddocuments. (OCR = optical character recognition)The general philosophy in IR is: don’t change the documents.

100 / 115

Page 102: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Correcting queries

First: isolated word spelling correctionBased on two assumptions:

Premise 1: There is a list of “correct words” from which thecorrect spellings come.Premise 2: We have a way of computing the distance betweena misspelled word and a correct word.

Simple spelling correction algorithm: return the “correct”word that has the smallest distance to the misspelled word.Example: informaton → informationFor the list of correct words, we can use the vocabulary of allwords that occur in our collection.Why is this problematic?

101 / 115

Page 103: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Alternatives to using the term vocabulary

A standard dictionary (Webster’s, OED etc.)An industry-specific dictionary (for specialized IR systems)The term vocabulary of the collection, appropriately weighted

102 / 115

Page 104: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Distance between misspelled word and “correct” word

There are several alternatives:Edit distance and Levenshtein distanceWeighted edit distancek-gram overlap

103 / 115

Page 105: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

k-gram indexes for spelling correction

Enumerate all k-grams in the query termExample: bigram index, misspelled word bordroomBigrams: bo, or, rd, dr, ro, oo, om

Use the k-gram index to retrieve “correct” words that matchquery term k-gramsThreshold by number of matching k-grams

E.g., only vocabulary terms that differ by at most 3 k-grams

104 / 115

Page 106: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

k-gram indexes for spelling correction: bord

rd aboard ardent boardroom border

or border lord morbid sordid

bo aboard about boardroom border

- - - -

- - - -

- - - -

BO ∩ OR ∩ RD = {border}terms matched twice: aboard, boardroom

105 / 115

Page 107: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Context-sensitive spelling correction

Our example was: an asteroid that fell form the skyHow can we correct form here?One idea: hit-based spelling correction

Retrieve “correct” terms close to each query termfor flew form munich: flea for flew, from for form, munch formunichNow try all possible resulting phrases as queries with one word“fixed” at a timeTry query “flea form munich”Try query “flew from munich”Try query “flew form munch”The correct query “flew from munich” has the most hits.

Suppose we have 7 alternatives for flew, 20 for form and 3 formunich, how many “corrected” phrases will we enumerate?

106 / 115

Page 108: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Context-sensitive spelling correction

The “hit-based” algorithm we just outlined is not veryefficient.More efficient alternative: look at “collection” of queries, notdocuments

107 / 115

Page 109: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

General issues in spelling correction

User interfaceautomatic vs. suggested correctionDid you mean only works for one suggestion.What about multiple possible corrections?Tradeoff: simple vs. powerful UI

CostSpelling correction is potentially expensive.Avoid running on every query?Maybe just on queries that match few documents.Guess: Spelling correction of major search engines is efficientenough to be run on every query.

108 / 115

Page 110: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Exercise: Understand Peter Norvig’s spelling corrector

import re, collectionsdef words(text): return re.findall('[a-z]+', text.lower())def train(features):

model = collections.defaultdict(lambda: 1)for f in features:

model[f] += 1return model

NWORDS = train(words(file('big.txt').read()))alphabet = 'abcdefghijklmnopqrstuvwxyz'def edits1(word):

splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]deletes = [a + b[1:] for a, b in splits if b]

transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b) gt 1]replaces = [a + c + b[1:] for a, b in splits for c in alphabet if b]inserts = [a + c + b for a, b in splits for c in alphabet]return set(deletes + transposes + replaces + inserts)

def known_edits2(word):return set(e2 for e1 in edits1(word) for e2 in edits1(e1) if e2 in NWORDS)

def known(words): return set(w for w in words if w in NWORDS)def correct(word):

candidates = known([word]) or known(edits1(word)) or known_edits2(word) or [word]return max(candidates, key=NWORDS.get)

http://norvig.com/spell-correct.html109 / 115

Page 111: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

1 Recap

2 Dictionaries

3 Wildcard queries

4 Edit distance

5 Spelling correction

6 Soundex

110 / 115

Page 112: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Soundex

Soundex is the basis for finding phonetic (as opposed toorthographic) alternatives.Example: chebyshev / tchebyscheffAlgorithm:

Turn every token to be indexed into a 4-character reduced formDo the same with query termsBuild and search an index on the reduced forms

111 / 115

Page 113: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Soundex algorithm

1 Retain the first letter of the term.2 Change all occurrences of the following letters to ’0’ (zero): A, E, I,

O, U, H, W, Y3 Change letters to digits as follows:

B, F, P, V to 1C, G, J, K, Q, S, X, Z to 2D,T to 3L to 4M, N to 5R to 6

4 Repeatedly remove one out of each pair of consecutive identical digits5 Remove all zeros from the resulting string; pad the resulting string

with trailing zeros and return the first four positions, which willconsist of a letter followed by three digits

112 / 115

Page 114: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Example: Soundex of HERMAN

Retain HERMAN → 0RM0N0RM0N → 0650506505 → 0650506505 → 655Return H655Note: HERMANN willgenerate the same code065055→ 06505→ 655

Retain the first letter of theterm.A, E, I, O, U, H, W, Y → 0;Change letters to digits asfollows:

B, F, P, V to 1C, G, J, K, Q, S, X, Z to2D,T to 3L to 4M, N to 5R to 6

reduce consecutive identicaldigitsremove zeros

113 / 115

Page 115: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

How useful is Soundex?

Not very – for information retrievalOk for “high recall” tasks in other applications (e.g., Interpol)Zobel and Dart (1996) suggest better alternatives for phoneticmatching in IR.

114 / 115

Page 116: Dictionaries and tolerant retrievaljlu.myweb.cs.uwindsor.ca/538/538dict2018.pdf · Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex Overview 1 Recap 2

Recap Dictionaries Wildcard queries Edit distance Spelling correction Soundex

Recap

fast access to the terms: hashing, B-treeTolerant retrieval: What to do if there is no exact matchbetween query term and document term

Wildcard queries.Spelling correction. edit distance. dynamic programmingalgorithm.k-gram indexsoundex

115 / 115