multimedia databases text - part i slides by c. faloutsos, cmu

60
Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Post on 21-Dec-2015

256 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multimedia Databases

Text - part I

Slides by C. Faloutsos, CMU

Page 2: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 2

Outline

Goal: ‘Find similar / interesting things’

• Intro to DB

• Indexing - similarity search

• Data Mining

Page 3: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 3

Indexing - Detailed outline• primary key indexing• secondary key / multi-key indexing• spatial access methods• fractals• text• multimedia• ...

Page 4: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 4

Text - Detailed outline

• text– problem– full text scanning– inversion– signature files– clustering – information filtering and LSI

Page 5: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 5

Problem - Motivation

• Eg., find documents containing “data”, “retrieval”

• Applications:

Page 6: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 6

Problem - Motivation

• Eg., find documents containing “data”, “retrieval”

• Applications:– Web– law + patent offices– digital libraries– information filtering

Page 7: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 7

Problem - Motivation

• Types of queries:– boolean (‘data’ AND ‘retrieval’ AND NOT ...)

Page 8: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 8

Problem - Motivation

• Types of queries:– boolean (‘data’ AND ‘retrieval’ AND NOT ...)– additional features (‘data’ ADJACENT

‘retrieval’)– keyword queries (‘data’, ‘retrieval’)

• How to search a large collection of documents?

Page 9: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 9

Full-text scanning

• Build a FSA; scan

ca

t

Page 10: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 10

Full-text scanning

• for single term:– (naive: O(N*M))

ABRACADABRA text

CAB pattern

Page 11: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 11

Full-text scanning

• for single term:– (naive: O(N*M))– Knuth Morris and Pratt (‘77)

• build a small FSA; visit every text letter once only, by carefully shifting more than one step

ABRACADABRA text

CAB pattern

Page 12: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 12

Full-text scanning

ABRACADABRA text

CAB pattern

CAB

CAB

CAB

...

Page 13: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 13

Full-text scanning

• for single term:– (naive: O(N*M))– Knuth Morris and Pratt (‘77)– Boyer and Moore (‘77)

• preprocess pattern; start from right to left & skip!

ABRACADABRA text

CAB pattern

Page 14: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 14

Full-text scanning

ABRACADABRA text

CAB pattern

CAB

CAB

CAB

Page 15: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 15

Full-text scanning

ABRACADABRA text

OMINOUS pattern

OMINOUS

Boyer+Moore: fastest, in practiceSunday (‘90): some improvements

Page 16: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 16

Full-text scanning

• For multiple terms (w/o “don’t care” characters): Aho+Corasic (‘75)– again, build a simplified FSA in O(M) time

• Probabilistic algorithms: ‘fingerprints’ (Karp + Rabin ‘87)

• approximate match: ‘agrep’ [Wu+Manber, Baeza-Yates+, ‘92]

Page 17: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 17

Full-text scanning

• Approximate matching - string editing distance:

d( ‘survey’, ‘surgery’) = 2 = min # of insertions, deletions,

substitutions to transform the first string into the second SURVEY SURGERY

Page 18: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 18

Full-text scanning

• string editing distance - how to compute?• A:

Page 19: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 19

Full-text scanning

• string editing distance - how to compute?• A: dynamic programming cost( i, j ) = cost to match prefix of length

i of first string s with prefix of length j of second string t

Page 20: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 20

Full-text scanning

if s[i] = t[j] then cost( i, j ) = cost(i-1, j-1)else cost(i, j ) = min ( 1 + cost(i, j-1) // deletion 1 + cost(i-1, j-1) // substitution 1 + cost(i-1, j) // insertion )

Page 21: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 21

Full-text scanning

Complexity: O(M*N) (when using a matrix to ‘memoize’ partial results)

Page 22: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 22

Full-text scanning

Conclusions: • Full text scanning needs no space overhead,

but is slow for large datasets

Page 23: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 23

Text - Detailed outline

• text– problem– full text scanning– inversion– signature files– clustering – information filtering and LSI

Page 24: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 24

Text - Inversion

Page 25: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 25

Text - Inversion

Q: space overhead?

Page 26: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 26

Text - Inversion

A: mainly, the postings lists

Page 27: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 27

Text - Inversion

• how to organize dictionary?

• stemming – Y/N?

• insertions?

Page 28: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 28

Text - Inversion

• how to organize dictionary?– B-tree, hashing, TRIEs, PATRICIA trees, ...

• stemming – Y/N?

• insertions?

Page 29: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 29

Text – Inversion

• newer topics:

• Parallelism [Tomasic+,93]

• Insertions [Tomasic+94], [Brown+]– ‘zipf’ distributions

• Approximate searching (‘glimpse’ [Wu+])

Page 30: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 30

• postings list – more Zipf distr.: eg., rank-frequency plot of ‘Bible’

log(rank)

log(freq)

Text - Inversion

freq ~ 1/rank / ln(1.78V)

Page 31: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 31

Text - Inversion

• postings lists– Cutting+Pedersen

• (keep first 4 in B-tree leaves)

– how to allocate space: [Faloutsos+92]• geometric progression

– compression (Elias codes) [Zobel+] – down to 2% overhead!

Page 32: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 32

Conclusions

• Conclusions: needs space overhead (2%-300%), but it is the fastest

Page 33: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 33

Text - Detailed outline

• text– problem– full text scanning– inversion– signature files– clustering – information filtering and LSI

Page 34: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 34

Signature files

• idea: ‘quick & dirty’ filter

Page 35: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 35

Signature files

• idea: ‘quick & dirty’ filter

• then, do seq. scan on sign. file and discard ‘false alarms’

• Adv.: easy insertions; faster than seq. scan

• Disadv.: O(N) search (with small constant)

• Q: how to extract signatures?

Page 36: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 36

Signature files

• A: superimposed coding!! [Mooers49], ...

m (=4 bits/word)F (=12 bits sign. size)

Page 37: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 37

Signature files

• A: superimposed coding!! [Mooers49], ...

data

actual match

Page 38: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 38

Signature files

• A: superimposed coding!! [Mooers49], ...

retrieval

actual dismissal

Page 39: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 39

Signature files

• A: superimposed coding!! [Mooers49], ...

nucleotic

false alarm (‘false drop’)

Page 40: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 40

Signature files

• A: superimposed coding!! [Mooers49], ...

‘YES’ is ‘MAYBE’ ‘NO’ is ‘NO’

Page 41: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 41

Signature files

• Q1: How to choose F and m ?

• Q2: Why is it called ‘false drop’?

• Q3: other apps of signature files?

Page 42: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 42

Signature files

• Q1: How to choose F and m ?

m (=4 bits/word)F (=12 bits sign. size)

Page 43: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 43

Signature files

• Q1: How to choose F and m ?

• A: so that doc. signature is 50% full

m (=4 bits/word)F (=12 bits sign. size)

Page 44: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 44

Signature files

• Q1: How to choose F and m ?

• Q2: Why is it called ‘false drop’?

• Q3: other apps of signature files?

Page 45: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 45

Signature files

• Q2: Why is it called ‘false drop’?

• Old, but fascinating story [1949]– how to find qualifying books (by title word,

and/or author, and/or keyword)– in O(1) time? – without computers

Page 46: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 46

Signature files

• Solution: edge-notched cards

......

1 2 40

•each title word is mapped to m numbers(how?)•and the corresponding holes are cut out:

Page 47: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 47

Signature files

• Solution: edge-notched cards

......

1 2 40

data

‘data’ -> #1, #39

Page 48: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 48

Signature files• Search, e.g., for ‘data’: activate needle #1,

#39, and shake the stack of cards!

......

1 2 40

data

‘data’ -> #1, #39

Page 49: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 49

Signature files• Also known as ‘zatocoding’, from ‘Zator’

company.

Page 50: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 50

Signature files

• Q1: How to choose F and m ?

• Q2: Why is it called ‘false drop’?

• Q3: other apps of signature files?

Page 51: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 51

Signature files

• Q3: other apps of signature files?

• A: anything that has to do with ‘membership testing’: does ‘data’ belong to the set of words of the document?

Page 52: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 52

Signature files

• UNIX’s early ‘spell’ system [McIlroy]

• Bloom-joins in System R* [Mackert+] and ‘active disks’ [Riedel99]

• differential files [Severance+Lohman]

Page 53: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 53

Signature files - conclusions

• easy insertions; slower than inversion

• brilliant idea of ‘quick and dirty’ filter: quickly discard the vast majority of non-qualifying elements, and focus on the rest.

Page 54: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 54

References

• Aho, A. V. and M. J. Corasick (June 1975). "Fast Pattern Matching: An Aid to Bibliographic Search." CACM 18(6): 333-340.

• Boyer, R. S. and J. S. Moore (Oct. 1977). "A Fast String Searching Algorithm." CACM 20(10): 762-772.

• Brown, E. W., J. P. Callan, et al. (March 1994). Supporting Full-Text Information Retrieval with a Persistent Object Store. Proc. of EDBT conference, Cambridge, U.K., Springer Verlag.

Page 55: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 55

References - cont’d

• Faloutsos, C. and H. V. Jagadish (Aug. 23-27, 1992). On B-tree Indices for Skewed Distributions. 18th VLDB Conference, Vancouver, British Columbia.

• Karp, R. M. and M. O. Rabin (March 1987). "Efficient Randomized Pattern-Matching Algorithms." IBM Journal of Research and Development 31(2): 249-260.

• Knuth, D. E., J. H. Morris, et al. (June 1977). "Fast Pattern Matching in Strings." SIAM J. Comput 6(2): 323-350.

Page 56: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 56

References - cont’d

• Mackert, L. M. and G. M. Lohman (August 1986). R* Optimizer Validation and Performance Evaluation for Distributed Queries. Proc. of 12th Int. Conf. on Very Large Data Bases (VLDB), Kyoto, Japan.

• Manber, U. and S. Wu (1994). GLIMPSE: A Tool to Search Through Entire File Systems. Proc. of USENIX Techn. Conf.

• McIlroy, M. D. (Jan. 1982). "Development of a Spelling List." IEEE Trans. on Communications COM-30(1): 91-99.

Page 57: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 57

References - cont’d

• Mooers, C. (1949). Application of Random Codes to the Gathering of Statistical Information

• Bulletin 31. Cambridge, Mass, Zator Co.

• Pedersen, D. C. a. J. (1990). Optimizations for dynamic inverted index maintenance. ACM SIGIR.

• Riedel, E. (1999). Active Disks: Remote Execution for Network Attached Storage. ECE, CMU. Pittsburgh, PA.

Page 58: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 58

References - cont’d

• Severance, D. G. and G. M. Lohman (Sept. 1976). "Differential Files: Their Application to the Maintenance of Large Databases." ACM TODS 1(3): 256-267.

• Tomasic, A. and H. Garcia-Molina (1993). Performance of Inverted Indices in Distributed Text Document Retrieval Systems. PDIS.

• Tomasic, A., H. Garcia-Molina, et al. (May 24-27, 1994). Incremental Updates of Inverted Lists for Text Document Retrieval. ACM SIGMOD, Minneapolis, MN.

Page 59: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 59

References - cont’d

• Wu, S. and U. Manber (1992). "AGREP- A Fast Approximate Pattern-Matching Tool." .

• Zobel, J., A. Moffat, et al. (Aug. 23-27, 1992). An Efficient Indexing Technique for Full-Text Database Systems. VLDB, Vancouver, B.C., Canada.

Page 60: Multimedia Databases Text - part I Slides by C. Faloutsos, CMU

Multi DB and D.M. Copyright: C. Faloutsos (2001) 60