cis750 – seminar in advanced topics in computer science advanced topics in databases –...
DESCRIPTION
Problem - Motivation Eg., find documents containing “data”, “retrieval” Applications:TRANSCRIPT
![Page 1: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/1.jpg)
CIS750 – Seminar in Advanced Topics in Computer Science
Advanced topics in databases – Multimedia Databases
V. MegalooikonomouText Databases
(some slides are based on notes by C. Faloutsos)
![Page 2: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/2.jpg)
Text - Detailed outline text
problem full text scanning inversion signature files clustering information filtering and LSI
![Page 3: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/3.jpg)
Problem - Motivation Eg., find documents containing
“data”, “retrieval” Applications:
![Page 4: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/4.jpg)
Problem - Motivation Eg., find documents containing
“data”, “retrieval” Applications:
Web law + patent offices digital libraries information filtering
![Page 5: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/5.jpg)
Problem - Motivation Types of queries:
boolean (‘data’ AND ‘retrieval’ AND NOT ...)
![Page 6: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/6.jpg)
Problem - Motivation Types of queries:
boolean (‘data’ AND ‘retrieval’ AND NOT ...)
additional features (‘data’ ADJACENT ‘retrieval’)
keyword queries (‘data’, ‘retrieval’)
How to search a large collection of documents?
![Page 7: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/7.jpg)
Full-text scanning Build a FSA; scan
ca
t
![Page 8: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/8.jpg)
Full-text scanning for single term:
(naive: O(N*M))
ABRACADABRA text
CAB pattern
![Page 9: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/9.jpg)
Full-text scanning for single term:
(naive: O(N*M)) Knuth Morris and Pratt (‘77)
build a small FSA; visit every text letter once only, by carefully shifting more than one step
ABRACADABRA text
CAB pattern
![Page 10: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/10.jpg)
Full-text scanning
ABRACADABRA text
CAB pattern
CAB
CAB
CAB
...
![Page 11: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/11.jpg)
Full-text scanning for single term:
(naive: O(N*M)) Knuth Morris and Pratt (‘77) Boyer and Moore (‘77)
preprocess pattern; start from right to left & skip!
ABRACADABRA text
CAB pattern
![Page 12: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/12.jpg)
Full-text scanning
ABRACADABRA text
CAB pattern
CABCAB
CAB
![Page 13: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/13.jpg)
Full-text scanning
ABRACADABRA text
OMINOUS patternOMINOUS
Boyer+Moore: fastest, in practiceSunday (‘90): some improvements
![Page 14: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/14.jpg)
Full-text scanning For multiple terms (w/o “don’t care”
characters): Aho+Corasic (‘75) again, build a simplified FSA in O(M)
time Probabilistic algorithms:
‘fingerprints’ (Karp + Rabin ‘87) approximate match: ‘agrep’
[Wu+Manber, Baeza-Yates+, ‘92]
![Page 15: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/15.jpg)
Full-text scanning Approximate matching - string editing
distance: d( ‘survey’, ‘surgery’) = 2 = min # of insertions, deletions,
substitutions to transform the first string into the second SURVEY SURGERY
![Page 16: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/16.jpg)
Full-text scanning string editing distance - how to
compute? A:
![Page 17: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/17.jpg)
Full-text scanning string editing distance - how to
compute? A: dynamic programming cost( i, j ) = cost to match prefix
of length i of first string s with prefix of length j of second string t
![Page 18: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/18.jpg)
Full-text scanning
if s[i] = t[j] then cost( i, j ) = cost(i-1, j-1)else cost(i, j ) = min ( 1 + cost(i, j-1) // deletion 1 + cost(i-1, j-1) // substitution 1 + cost(i-1, j) // insertion )
![Page 19: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/19.jpg)
Full-text scanning
Complexity: O(M*N) (when using a matrix to ‘memoize’ partial results)
![Page 20: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/20.jpg)
Full-text scanning
Conclusions: Full text scanning needs no space
overhead, but is slow for large datasets
![Page 21: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/21.jpg)
Text - Detailed outline text
problem full text scanning inversion signature files clustering information filtering and LSI
![Page 22: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/22.jpg)
Text - Inversion
![Page 23: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/23.jpg)
Text - Inversion
Q: space overhead?
![Page 24: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/24.jpg)
Text - Inversion
A: mainly, the postings lists
![Page 25: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/25.jpg)
Text - Inversion
how to organize dictionary?
stemming – Y/N? insertions?
![Page 26: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/26.jpg)
Text - Inversion
how to organize dictionary? B-tree, hashing, TRIEs, PATRICIA
trees, ... stemming – Y/N? insertions?
![Page 27: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/27.jpg)
Text – Inversion
newer topics: Parallelism [Tomasic+,93] Insertions [Tomasic+94], [Brown+]
‘zipf’ distributions Approximate searching (‘glimpse’
[Wu+])
![Page 28: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/28.jpg)
postings list – more Zipf distr.: eg., rank-frequency plot of ‘Bible’
log(rank)
log(freq)
Text - Inversion
freq ~ 1 / (rank * ln(1.78V))
![Page 29: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/29.jpg)
Text - Inversion
postings lists Cutting+Pedersen
(keep first 4 in B-tree leaves) how to allocate space: [Faloutsos+92]
geometric progression compression (Elias codes) [Zobel+] –
down to 2% overhead!
![Page 30: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/30.jpg)
Conclusions
Conclusions: needs space overhead (2%-300%), but it is the fastest
![Page 31: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/31.jpg)
Text - Detailed outline text
problem full text scanning inversion signature files clustering information filtering and LSI
![Page 32: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/32.jpg)
Signature files
idea: ‘quick & dirty’ filter
![Page 33: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/33.jpg)
Signature files
idea: ‘quick & dirty’ filter then, do seq. scan on sign. file and
discard ‘false alarms’ Adv.: easy insertions; faster than seq.
scan Disadv.: O(N) search (with small
constant) Q: how to extract signatures?
![Page 34: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/34.jpg)
Signature files
A: superimposed coding!! [Mooers49], ...
m (=4 bits/word) ~ (=4 bits set to “1” and the rest left as “0”)F (=12 bits sign. size)the bit patterns are OR-ed to form the document signature
![Page 35: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/35.jpg)
Signature files
A: superimposed coding!! [Mooers49], ...
data
actual match
![Page 36: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/36.jpg)
Signature files
A: superimposed coding!! [Mooers49], ...
retrieval
actual dismissal
![Page 37: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/37.jpg)
Signature files
A: superimposed coding!! [Mooers49], ...
nucleotic
false alarm (‘false drop’)
![Page 38: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/38.jpg)
Signature files
A: superimposed coding!! [Mooers49], ...
‘YES’ is ‘MAYBE’ ‘NO’ is ‘NO’
![Page 39: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/39.jpg)
Signature files
Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?
![Page 40: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/40.jpg)
Signature files
Q1: How to choose F and m ?
m (=4 bits/word)F (=12 bits sign. size)
![Page 41: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/41.jpg)
Signature files
Q1: How to choose F and m ? A: so that doc. signature is 50%
full
m (=4 bits/word)F (=12 bits sign. size)
![Page 42: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/42.jpg)
Signature files
Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?
![Page 43: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/43.jpg)
Signature files
Q2: Why is it called ‘false drop’? Old, but fascinating story [1949]
how to find qualifying books (by title word, and/or author, and/or keyword)
in O(1) time? without computers
![Page 44: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/44.jpg)
Signature files
Solution: edge-notched cards
......
1 2 40
•each title word is mapped to m numbers(how?)•and the corresponding holes are cut out:
![Page 45: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/45.jpg)
Signature files
Solution: edge-notched cards
......
1 2 40
data
‘data’ -> #1, #39
![Page 46: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/46.jpg)
Signature files Search, e.g., for ‘data’: activate
needle #1, #39, and shake the stack of cards!
......
1 2 40
data
‘data’ -> #1, #39
![Page 47: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/47.jpg)
Signature files Also known as ‘zatocoding’, from
‘Zator’ company.
![Page 48: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/48.jpg)
Signature files
Q1: How to choose F and m ? Q2: Why is it called ‘false drop’? Q3: other apps of signature files?
![Page 49: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/49.jpg)
Signature files Q3: other apps of signature files? A: anything that has to do with
‘membership testing’: does ‘data’ belong to the set of words of the document?
![Page 50: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/50.jpg)
Signature files
UNIX’s early ‘spell’ system [McIlroy]
Bloom-joins in System R* [Mackert+] and ‘active disks’ [Riedel99]
differential files [Severance+Lohman]
![Page 51: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/51.jpg)
Signature files - conclusions
easy insertions; slower than inversion
brilliant idea of ‘quick and dirty’ filter: quickly discard the vast majority of non-qualifying elements, and focus on the rest.
![Page 52: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/52.jpg)
References
Aho, A. V. and M. J. Corasick (June 1975). "Fast Pattern Matching: An Aid to Bibliographic Search." CACM 18(6): 333-340.
Boyer, R. S. and J. S. Moore (Oct. 1977). "A Fast String Searching Algorithm." CACM 20(10): 762-772.
Brown, E. W., J. P. Callan, et al. (March 1994). Supporting Full-Text Information Retrieval with a Persistent Object Store. Proc. of EDBT conference, Cambridge, U.K., Springer Verlag.
![Page 53: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/53.jpg)
References - cont’d
Faloutsos, C. and H. V. Jagadish (Aug. 23-27, 1992). On B-tree Indices for Skewed Distributions. 18th VLDB Conference, Vancouver, British Columbia.
Karp, R. M. and M. O. Rabin (March 1987). "Efficient Randomized Pattern-Matching Algorithms." IBM Journal of Research and Development 31(2): 249-260.
Knuth, D. E., J. H. Morris, et al. (June 1977). "Fast Pattern Matching in Strings." SIAM J. Comput 6(2): 323-350.
![Page 54: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/54.jpg)
References - cont’d
Mackert, L. M. and G. M. Lohman (August 1986). R* Optimizer Validation and Performance Evaluation for Distributed Queries. Proc. of 12th Int. Conf. on Very Large Data Bases (VLDB), Kyoto, Japan.
Manber, U. and S. Wu (1994). GLIMPSE: A Tool to Search Through Entire File Systems. Proc. of USENIX Techn. Conf.
McIlroy, M. D. (Jan. 1982). "Development of a Spelling List." IEEE Trans. on Communications COM-30(1): 91-99.
![Page 55: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/55.jpg)
References - cont’d
Mooers, C. (1949). Application of Random Codes to the Gathering of Statistical Information
Bulletin 31. Cambridge, Mass, Zator Co. Pedersen, D. C. a. J. (1990). Optimizations for
dynamic inverted index maintenance. ACM SIGIR.
Riedel, E. (1999). Active Disks: Remote Execution for Network Attached Storage. ECE, CMU. Pittsburgh, PA.
![Page 56: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/56.jpg)
References - cont’d
Severance, D. G. and G. M. Lohman (Sept. 1976). "Differential Files: Their Application to the Maintenance of Large Databases." ACM TODS 1(3): 256-267.
Tomasic, A. and H. Garcia-Molina (1993). Performance of Inverted Indices in Distributed Text Document Retrieval Systems. PDIS.
Tomasic, A., H. Garcia-Molina, et al. (May 24-27, 1994). Incremental Updates of Inverted Lists for Text Document Retrieval. ACM SIGMOD, Minneapolis, MN.
![Page 57: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/57.jpg)
References - cont’d
Wu, S. and U. Manber (1992). "AGREP- A Fast Approximate Pattern-Matching Tool." .
Zobel, J., A. Moffat, et al. (Aug. 23-27, 1992). An Efficient Indexing Technique for Full-Text Database Systems. VLDB, Vancouver, B.C., Canada.
![Page 58: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/58.jpg)
Text - Detailed outline text
problem full text scanning inversion signature files clustering information filtering and LSI
![Page 59: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/59.jpg)
Vector Space Model and Clustering
keyword queries (vs Boolean) each document: -> vector (HOW?) each query: -> vector search for ‘similar’ vectors
![Page 60: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/60.jpg)
Vector Space Model and Clustering
main idea:
document
...data...
aaron zoodata
V (= vocabulary size)
‘indexing’
![Page 61: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/61.jpg)
Vector Space Model and Clustering
Then, group nearby vectors together Q1: cluster search? Q2: cluster generation?
Two significant contributions ranked output relevance feedback
![Page 62: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/62.jpg)
Vector Space Model and Clustering
cluster search: visit the (k) closest superclusters; continue recursively
CS TRs
TU TRs
![Page 63: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/63.jpg)
Vector Space Model and Clustering
ranked output: easy!
CS TRs
TU TRs
![Page 64: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/64.jpg)
Vector Space Model and Clustering
relevance feedback (brilliant idea) [Roccio’73]
CS TRs
TU TRs
![Page 65: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/65.jpg)
Vector Space Model and Clustering
relevance feedback (brilliant idea) [Roccio’73]
How?
CS TRs
TU TRs
![Page 66: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/66.jpg)
Vector Space Model and Clustering
How? A: by adding the ‘good’ vectors and subtracting the ‘bad’ ones
CS TRs
TU TRs
![Page 67: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/67.jpg)
Outline - detailed
main idea cluster search cluster generation evaluation
![Page 68: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/68.jpg)
Cluster generation Problem:
given N points in V dimensions, group them
![Page 69: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/69.jpg)
Cluster generation Problem:
given N points in V dimensions, group them
![Page 70: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/70.jpg)
Cluster generation
We need Q1: document-to-document
similarity Q2: document-to-cluster similarity
![Page 71: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/71.jpg)
Cluster generation
Q1: document-to-document similarity
(recall: ‘bag of words’ representation)
D1: {‘data’, ‘retrieval’, ‘system’} D2: {‘lung’, ‘pulmonary’, ‘system’} distance/similarity functions?
![Page 72: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/72.jpg)
Cluster generation
A1: # of words in commonA2: ........ normalized by the vocabulary
sizesA3: .... etc
About the same performance - prevailing one:
cosine similarity
![Page 73: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/73.jpg)
Cluster generation
cosine similarity: similarity(D1, D2) = cos(θ) = sum(v1,i * v2,i) / len(v1)/ len(v2)
θ
D1
D2
![Page 74: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/74.jpg)
Cluster generation
cosine similarity - observations: related to the Euclidean distance weights vi,j : according to tf/idf
θ
D1
D2
![Page 75: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/75.jpg)
Cluster generation
tf (‘term frequency’)high, if the term appears very often in
this document.idf (‘inverse document frequency’)
penalizes ‘common’ words, that appear in almost every document
![Page 76: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/76.jpg)
Cluster generation
We need Q1: document-to-document
similarity Q2: document-to-cluster similarity
?
![Page 77: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/77.jpg)
Cluster generation A1: min distance (‘single-link’) A2: max distance (‘all-link’) A3: avg distance A4: distance to centroid
?
![Page 78: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/78.jpg)
Cluster generation A1: min distance (‘single-link’)
leads to elongated clusters A2: max distance (‘all-link’)
many, small, tight clusters A3: avg distance
in between the above A4: distance to centroid
fast to compute
![Page 79: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/79.jpg)
Cluster generation
We have document-to-document similarity document-to-cluster similarity
Q: How to group documents into ‘natural’ clusters
![Page 80: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/80.jpg)
Cluster generation
A: *many-many* algorithms - in two groups [VanRijsbergen]:
theoretically sound (O(N^2)) independent of the insertion order
iterative (O(N), O(N log(N))
![Page 81: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/81.jpg)
Cluster generation - ‘sound’ methods
Approach#1: dendrograms - create a hierarchy (bottom up or top-down) - choose a cut-off (how?) and cut
cat tiger horse cow0.10.3
0.8
![Page 82: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/82.jpg)
Cluster generation - ‘sound’ methods
Approach#2: min. some statistical criterion (eg., sum of squares from cluster centers) like ‘k-means’ but how to decide ‘k’?
![Page 83: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/83.jpg)
Cluster generation - ‘sound’ methods
Approach#3: Graph theoretic [Zahn]: build MST; delete edges longer than 2.5* std of
the local average
![Page 84: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/84.jpg)
Cluster generation - ‘sound’ methods
Result:• variations
• Complexity?
![Page 85: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/85.jpg)
Cluster generation - ‘iterative’ methods
general outline: Choose ‘seeds’ (how?) assign each vector to its closest seed
(possibly adjusting cluster centroid) possibly, re-assign some vectors to
improve clustersFast and practical, but ‘unpredictable’
![Page 86: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/86.jpg)
Cluster generation - ‘iterative’ methods
general outline: Choose ‘seeds’ (how?) assign each vector to its closest seed
(possibly adjusting cluster centroid) possibly, re-assign some vectors to
improve clustersFast and practical, but ‘unpredictable’
![Page 87: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/87.jpg)
Cluster generation
one way to estimate # of clusters k: the ‘cover coefficient’ [Can+] ~ SVD
![Page 88: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/88.jpg)
Outline - detailed
main idea cluster search cluster generation evaluation
![Page 89: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/89.jpg)
Evaluation
Q: how to measure ‘goodness’ of one distance function vs another?
A: ground truth (by humans) and ‘precision’ and ‘recall’
![Page 90: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/90.jpg)
Evaluation
precision = (retrieved & relevant) / retrieved 100% precision -> no false alarms
recall = (retrieved & relevant)/ relevant 100% recall -> no false dismissals
![Page 91: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/91.jpg)
References
Can, F. and E. A. Ozkarahan (Dec. 1990). "Concepts and Effectiveness of the Cover-Coefficient-Based Clustering Methodology for Text Databases." ACM TODS 15(4): 483-517.
Noreault, T., M. McGill, et al. (1983). A Performance Evaluation of Similarity Measures, Document Term Weighting Schemes and Representation in a Boolean Environment. Information Retrieval Research, Butterworths.
Rocchio, J. J. (1971). Relevance Feedback in Information Retrieval. The SMART Retrieval System - Experiments in Automatic Document Processing. G. Salton. Englewood Cliffs, New Jersey, Prentice-Hall Inc.
![Page 92: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/92.jpg)
References - cont’d
Salton, G. (1971). The SMART Retrieval System - Experiments in Automatic Document Processing. Englewood Cliffs, New Jersey, Prentice-Hall Inc.
Salton, G. and M. J. McGill (1983). Introduction to Modern Information Retrieval, McGraw-Hill.
Van-Rijsbergen, C. J. (1979). Information Retrieval. London, England, Butterworths.
Zahn, C. T. (Jan. 1971). "Graph-Theoretical Methods for Detecting and Describing Gestalt Clusters." IEEE Trans. on Computers C-20(1): 68-86.
![Page 93: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/93.jpg)
Text - Detailed outline text
problem full text scanning inversion signature files clustering information filtering and LSI
![Page 94: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/94.jpg)
LSI - Detailed outline LSI
problem definition main idea experiments
![Page 95: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/95.jpg)
Information Filtering + LSI [Foltz+,’92] Goal:
users specify interests (= keywords) system alerts them, on suitable news-
documents Major contribution: LSI = Latent
Semantic Indexing latent (‘hidden’) concepts
![Page 96: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/96.jpg)
Information Filtering + LSI
Main idea map each document into some
‘concepts’ map each term into some ‘concepts’
‘Concept’:~ a set of terms, with weights, e.g. “data” (0.8), “system” (0.5), “retrieval”
(0.6) -> DBMS_concept
![Page 97: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/97.jpg)
Information Filtering + LSI
Pictorially: term-document matrix (BEFORE)
'data' 'system' 'retrieval' 'lung' 'ear'TR1 1 1 1TR2 1 1 1TR3 1 1TR4 1 1
![Page 98: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/98.jpg)
Information Filtering + LSI
Pictorially: concept-document matrix and...
'DBMS-concept'
'medical-concept'
TR1 1TR2 1TR3 1TR4 1
![Page 99: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/99.jpg)
Information Filtering + LSI
... and concept-term matrix'DBMS-concept'
'medical-concept'
data 1system 1retrieval 1lung 1ear 1
![Page 100: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/100.jpg)
Information Filtering + LSI
Q: How to search, eg., for ‘system’?
![Page 101: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/101.jpg)
Information Filtering + LSI
A: find the corresponding concept(s); and the corresponding documents
'DBMS-concept'
'medical-concept'
data 1system 1retrieval 1lung 1ear 1
'DBMS-concept'
'medical-concept'
TR1 1TR2 1TR3 1TR4 1
![Page 102: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/102.jpg)
Information Filtering + LSI
A: find the corresponding concept(s); and the corresponding documents
'DBMS-concept'
'medical-concept'
data 1system 1retrieval 1lung 1ear 1
'DBMS-concept'
'medical-concept'
TR1 1TR2 1TR3 1TR4 1
![Page 103: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/103.jpg)
Information Filtering + LSI
Thus it works like an (automatically constructed) thesaurus:
we may retrieve documents that DON’T have the term ‘system’, but they contain almost everything else (‘data’, ‘retrieval’)
![Page 104: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/104.jpg)
LSI - Detailed outline LSI
problem definition main idea experiments
![Page 105: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/105.jpg)
LSI - Experiments 150 Tech Memos (TM) / month 34 users submitted ‘profiles’ (6-66
words per profile) 100-300 concepts
![Page 106: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/106.jpg)
LSI - Experiments four methods, cross-product of:
vector-space or LSI, for similarity scoring
keywords or document-sample, for profile specification
measured: precision/recall
![Page 107: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/107.jpg)
LSI - Experiments LSI, with document-based profiles,
were better precision
recall
(0.25,0.65)
(0.50,0.45)
(0.75,0.30)
![Page 108: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/108.jpg)
LSI - Discussion - Conclusions
Great idea, to derive ‘concepts’ from documents to build a ‘statistical thesaurus’
automatically to reduce dimensionality
Often leads to better precision/recall but:
Needs ‘training’ set of documents ‘concept’ vectors are not sparse anymore
![Page 109: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/109.jpg)
LSI - Discussion - Conclusions
Observations Bellcore (-> Telcordia) has a
patent used for multi-lingual retrieval
How exactly SVD works?
![Page 110: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/110.jpg)
Indexing - Detailed outline primary key indexing secondary key / multi-key indexing spatial access methods fractals text SVD: a powerful tool multimedia ...
![Page 111: CIS750 – Seminar in Advanced Topics in Computer Science Advanced topics in databases – Multimedia Databases V. Megalooikonomou Text Databases (some slides](https://reader033.vdocuments.mx/reader033/viewer/2022061608/5a4d1acd7f8b9ab0599700c1/html5/thumbnails/111.jpg)
References
Foltz, P. W. and S. T. Dumais (Dec. 1992). "Personalized Information Delivery: An Analysis of Information Filtering Methods." Comm. of ACM (CACM) 35(12): 51-60.