text database

Text Database

• Text Fuzzy Match• Keyword/Key Phrase Extraction• Fuzzy Match Applications: Music Retrieval• Term Association Analysis• Text Categorization• Text Summarization, Clustering, and Its

Applications

Yuen-Hsien Tseng / 曾元顯Digital Media Center

National Taiwan Normal University

Values of This Course• Information access service is highly profitable

– Text database techniques are its core technologies

Re-UseRe-Use

AccessAccess

Re-OrganizeRe-Organize

CaptureCapture

CreateCreate

Share公司市值(亿)

微軟 2783

Google 1226

eBay 602

Yahoo 556

百度 15

By jiyanjiang on 2006-01-03

Text Fuzzy Match

• Applications– Text search, text highlight– duplicate detection, record linkage

• Levels of Text Fuzzy Match– Phonic: homonyms, transliteration match– Symbolic: most common type of text match– Semantic: knowledge bases are often needed

Text Fuzzy Match – Phonic (En)• For English, use Soundex code

– a hashing system for English words

– used by the United States Census Bureau

– by Robert C. Russell (U.S. patent 1,261,167 on April 2, 1918)

• Soundex Algorithm– begin with the first letter of a word followed by a three-digit (1-6) code that

represents the (first three) remaining consonants

– If the word is too short to generate 3 numbers, 0 is added as needed.

– If the generated code is longer than 3 numbers, the extra are thrown away.

Code Letters Description

1 B, F, P, V Labial

2 C, G, J, K, Q, S, X, Z Gutterals and sibilants

3 D, T Dental

4 L Long liquid

5 M, N Nasal

6 R Short liquid

SKIP A, E, H, I, O, U, W, Y Vowels (and H, W, and Y) are skipped

Text Fuzzy Match – Phonic (En)

• Examples:– perl -e "use Text::Soundex; print soundex('anything')"– Ans: A535

• References:– http://www.highprogrammer.com/alan/numbers/soundex.html

– http://searches.rootsweb.com/cgi-bin/Genea/soundex.sh

• Applications:– Yuen-Hsien Tseng, "Automatic Cataloguing and Searching for

Retrospective Data by Use of OCR Text", Journal of the American Society for Information Science and Technology, Vol. 52, No. 5, April 2001, pp. 378-390. (SSCI, SCI, EI)

Text Fuzzy Match – Phonic (Ch)• For Chinese,

– convert each character into its phonograms or pinyins– match their phonograms or pinyins instead of their characters

• Applications– Transliteration match: 克麗斯汀 vs 克莉思汀– Traditional/simplified character match

• Example:– D:\demo\ir\sam.lib>perl -s Fuzzyhighlight.pm " 國內大型電腦集團赴大陸

投資升溫 " " 國內電腦廠商赴大陸頭姿 " 1

– Ans: 國內大型電腦集團赴大陸投資 升溫– D:\demo\ir\sam.lib>perl -s Fuzzyhighlight.pm " 國內大型電腦集團赴大陸

投資升溫 " " 國內電腦廠商 , 赴大陸投資 , 頭姿 " 1

– Ans: 國內大型電腦集團 赴大陸 投資 升溫

Text Match (Naïve)

• Simple brute force (naive) algorithm: O(nm) in its worse case– Text: T[n], where n is the text length– Pattern: P[m], where m is the pattern length

• Naïve Algorithm:for (i=0; T[i] != '\0'; i++) { for (j=0; T[i+j] != '\0' && P[j] != '\0' && T[i+j]==P[j]; j++) ; if (P[j] == '\0') found a match}

• Example: looking for pattern "nano" in text "banananobano"– Each row represents an iteration of the outer loop, with each character in the

row representing the result of a comparison (X if the comparison was unequal)

0 1 2 3 4 5 6 7 8 9 10 11 T: b a n a n a n o b a n oi=0: Xi=1: Xi=2: n a n Xi=3: Xi=4: n a n oi=5: Xi=6: n Xi=7: Xi=8: Xi=9: n Xi=10: X

Text Match (Knuth-Morris-Pratt)

• Knuth-Morris-Pratt Algorithm– Skipping outer iterations in the above naïve algorithm

• If you've found a partial match of j characters starting at position i, you know what's in positions T[i]...T[i+j-1].

• i=2: n a n• i=3: n a n o <= no need to do this comparison• i=4: n a n o <= just skip to this step

– Skipping inner iterations• i=2: n a n• i=4: n a n o <= n has been compared at i=2, skip this n further

• Complexity: O(m+n) guaranteed• Reference:

– http://www.ics.uci.edu/~eppstein/161/960227.html

Fuzzy Text Match

• Become popular after Prof. Sun Wu ( 吳昇 ) published his papers since 1992 – complexity O(kn), k: number of allowed errors

• But his algorithm still is not good enough for music retrieval (as noted by Ian Witten in their papers about music retrieval)

• Reference:– Sun Wu and Udi Manber, “Agrep—A Fast Approximate Pattern-Matching

Tool” Usenix Winter 1992 - 被引用 140 次 – Sun Wu and Udi Manber, ‘‘Fast text searching: allowing errors,’’

Communications of the ACM, Vol. 35 , Issue 10 (Oct. 1992) pp: 83 - 91 – cited 71 times in ACM portal

– Baeza-Yates, R.A. and Gonnet, G.H. A new approach to text searching. Commun. ACM 35, 10 (1992). Preliminary version appeared in Proceedings of the 12th Annual ACM-SIGIR Conference on Information Retrieval, Cambridge, Mass. (June 1989), pp. 168-175.

Dynamic Programming for Fuzzy Match

• DP algorithm: 比較兩字串，允許字元的插入、刪除、代換• 將兩字串在二維矩陣中展開（如下頁範例），求累計距離• d[i, j] = min( d[i-1, j] + w(A[i], 0),

d[i-1, j-1] + w(A[i], B[j]), d[i, j-1] + w(0, B[j]) )

– d[i, j] : accumulated distance between sequence A and B– w(A[i], B[j]) : cost of substituting A[i]with B[j]– w(A[i], 0) : cost of inserting A[i]– w(0, B[j]) : cost of deleting B[j]– d[0, 0] = 0– d[i, 0] = d[i-1, 0] + w(A[i], 0) , 1<= i <= m– d[0, j] = d[0, j-1] + w(0, B[j]) , 1<= j <= n

• Complexity: O(n * m), m : length of A, n : length of B

( i, j )

(i-1, j)

(i, j-1)

(i-1,j-1)

Basic Dynamic Programming: Examples• 假設文件 B=adecdecf ，查詢詞 A=adec ，則 d[4,4] 為 0 ，最小，

從 d[4,4] 倒推回 d[3,3] ，再倒推回 d[2,2] ，再倒推回 d[1,1], 一路都有最小值 0, 可知 match 的位置在最開頭 adec

• 但假若 match 的字串 adec 在 B 的中間，就沒有預期的最小值 0

• 很多人都不知道如何改進，因此認為 DP 只能從頭開始比起，尤其是一些音樂檢索系統，只能允許旋律從頭比起，限制很大

• 改進辦法，在第一列插入全部都是 0 的列，如下頁

BA

a d e c d e c f

a 0 1 2 3 4 5 6 7d 1 0 1 2 3 4 5 6e 2 1 0 1 2 3 4 5c 3 2 1 0 1 2 3 4

BA

a c d a d e c f

a 0 1 2 3 4 5 6 7d 1 1 1 2 3 4 5 6e 2 2 2 2 3 3 4 5c 3 2 3 3 3 4 3 4

Improved Dynamic Programming• 在第一列插入全部都是 0 的列，其效果猶如允許從任何位置開始比較，亦即，初始值

– 從 d[0, j] = d[0, j-1] + w(0, B[j]) , 1<= j <= n– 改成 d[0, j] = 0 , 1<= j <= n

• 最後一列的值，範圍為 0 到 m ，定義一函數：其中 min 是最後一列的最小值

• 函數 d(m) 的範圍在 0 到 1 之間，其與 m 的關係如下圖，因此，取門檻值如下：if ($min == $lenq) { $dm = 0; } else { $dm = 1/exp($min/($lenq-$min)); }

if (($lenq < 5 && $dm < 0.7) or ($dm < 0.54)) { … } #The No Match threshold

Reference: Daniel Lopresti and Jiangying Zhou, “Retrieval Strategies for Noisy Text,” Proceedings of the Fifth Annual Symposium on Document Analysis and Information Retrieval, April 15-17, 1996, pp. 255-269.

B A

a c d a d e c f

0 0 0 0 0 0 0 0 a 0 1 1 0 1 1 1 1 d 1 1 1 1 0 1 2 2 e 2 2 2 2 2 0 1 2 c 3 3 3 3 3 1 0 1

minmin

exp

1)(

mmd

應用範例（ 1/3 ）• 目前我實做的 fuzzy highlight 規格：

– 中文、中文同音、英文、英文同音、中英文、中英文同音、多詞 highlight• 底下的 Perl 範例程式，第一個參數是待標示的文件（字串），第二個參

數是查詢詞（可多個，用「，」或「 , 」分開即可），第三個參數如果不為 0 , 則表示要做同音比對。

D:\>perl -s fuzzymatch5.pl " 國內大型電腦集團赴大陸投資升溫 " " 國內電腦廠商赴大陸投資 "

Result: 國內大型電腦集團赴大陸投資 升溫

D:\>perl -s fuzzymatch5.pl “ 國際商業股份公司” “傷葉古份” 1Result: 國際 商業股份 公司

D:\>perl -s fuzzymatch5.pl “ 這是 CD-R 光碟片” “ cdr 胱跌騙” 1Result: 這是 CD-R 光碟片 

D:\>perl -s fuzzymatch5.pl "My name is 'sam Tseng'" "sam Tseng"Result:My name is 'sam Tseng'

應用範例（ 2/3 ）• 中文同音、多詞 highlight ：

D:\>perl -s fuzzymatch5.pl " 國內大型電腦集團赴大陸投資升溫 " " 國內電腦廠商 , 赴大陸投資 , 生瘟 " 1

Result: 國內大型電腦集團 赴大陸投資 升溫 

D:\>perl -s fuzzymatch5.pl " 國內大型電腦集團赴大陸投資升溫 " " 國內電腦廠商 , 赴大陸投資 , 頭姿 " 1

Result: 國內大型電腦集團 赴大陸 投資 升溫

• 中英文模糊 highlight ：D:\>perl -s fuzzymatch5.pl " 國內 computer 集團赴大陸投資升溫 " " 國內

camputorization 集團 "Result: 國內 computer 集團 赴大陸投資升溫

應用範例（ 3/3 ）• 英文的同音 high light ：D:\>perl -s fuzzymatch5.pl " 國內 computer 集團赴大陸投資

升溫 " " 國內 camputorization"

Result: 國內 computer 集團赴大陸投資升溫

– 錯太多，無法 high light ，要改用同音比對

D:\>perl -s fuzzymatch5.pl " 國內 computer 集團赴大陸投資升溫 " " 國內 camputorization" 1

Result: 國內 computer 集團赴大陸投資升溫

Conclusions (1)• We have shown a fuzzy match tool that is capable of

– matching strings with some edit distance • regardless of spaces and capitalization• insensitive to plurals and tense• allowing homonym match

– converting the edit distance into a similarity score– highlighting similar query string in the text

• Google segments the query string into terms and highlight terms in the text

• Applications:– Music Retrieval

• Yuen-Hsien Tseng, "Content-Based Retrieval for Music Collections," ACM SIGIR '99, Aug. 15-19, Berkeley, U.S.A., 1999, pp.176-182.

– Duplicate detection• Yuen-Hsien Tseng and William John Teahan, "Verifying a Chinese Collection

for Text Categorization," ACM SIGIR '04, July 25 - 29, U.K., 2004, pp.556-557.

• 曾元顯 , " 分類不一致對文件自動分類效果的影響 ", 「大學圖書館」 , 第 9 卷第 1 期 , 2005 年 3 月 , 頁 2-19.

Conclusions (2)

• However, its complexity is– O(mn) in time and O(n) in space

– limited to applications of only small text sizes

• Do we have better solutions?– Yes, index the texts to be searched first and then match them

with the query string based on some retrieval models

– In this way, we also are able to deal with semantic text match to some degree

– In short, we need information retrieval (IR) techniques

text database

Documents