similarity measure

32
Chapter 3 Similarity Measures Data Mining Technology

Upload: zhao-sam

Post on 10-May-2015

17.961 views

Category:

Technology


4 download

DESCRIPTION

A similarity measure can represent the similarity between two documents, two queries, or one document and one query

TRANSCRIPT

Page 1: similarity measure

Chapter 3

Similarity Measures

Data Mining Technology

Page 2: similarity measure

Chapter 3

Similarity MeasuresWritten by Kevin E. Heinrich

Presented by Zhao Xinyou

[email protected]

2007.6.7

Some materials (Examples) are taken from Website.

Page 3: similarity measure

Searching Process

Input Text

Process

IndexQuery

Sorting

Show Text Result

Input Text

IndexQuery

Sorting

Show Text Result

Input Key Words

Results

Search

1. XXXXXXX2. YYYYYYY3. ZZZZZZZZ.......................

Page 4: similarity measure

Example

Page 5: similarity measure

Similarity Measures A similarity measure can represent the similar

ity between two documents, two queries, or one document and one query

It is possible to rank the retrieved documents in the order of presumed importance

A similarity measure is a function which computes the degree of similarity between a pair of text objects

There are a large number of similarity measures proposed in the literature, because the best similarity measure doesn't exist (yet!)

PP27-28

Page 6: similarity measure

Classic Similarity Measures

All similarity measures should map to the range [-1,1] or [0,1],

0 or -1 shows minimum similarity. (incompatible similarity)

1 shows maximum similarity. (absolute similarity)

PP28

Page 7: similarity measure

Conversion

For example 1 shows incompatible similarity, 10 shows

absolute similarity.

[1, 10]

[0, 1]

s’ = (s – 1 ) / 9

Generally, we may use:s’ = ( s – min_s ) / ( max_s – min_s )

LinearNon-linear

PP28

Page 8: similarity measure

Vector-Space Model-VSM

1960s Salton etc provided VSM, which has been successfully applied on SMART (a text searching system).

PP28-29

Page 9: similarity measure

Example D is a set, which contains m Web documents;

D={d1, d2,…di…dm} i=1,2…m

There are n words among m Web documents. di={wi1,wi2,…wij,…win} i=1,2…m , j=1,2,….n

Q= {q1,q2,…qi,….qn} i=1,2,….n

PP28-29

If similarity(q,di) > similarity(q, dj) We may get the result di is more relevant than dj

Page 10: similarity measure

Simple Measure Technology

Documents Set

PP29

Retrieved A

Relevant B

Retrieved and Relevant A∩B

Precision = Returned Relevant Documents / Total Returned Documents

Recall = Returned Relevant Documents / Total Relevant Documents

P(A,B) = |A∩B| / |A|

R(A,B) = |A∩B| / |B|

Page 11: similarity measure

Example--Simple Measure Technology

PP29

Documents Set

A,C,E,G,H, I, JRelevant

B,D,FRetrieved & Relevant

W,YRetrieved

|B| = {relevant} ={A,B,C,D,E,F,G,H,I,J} = 10

|A| = {retrieved} = {B, D, F,W,Y} = 5

|A∩B| = {relevant} ∩ {retrieved} ={B,D,F} = 3

P = precision = 3/5 = 60%

R = recall = 3/10 = 30%

Page 12: similarity measure

Precision-Recall Graph-Curves There is a tradeoff between Precision and Recall So measure Precision at different levels of Recall

PP29-30

One QueryTwo Queries

Difficult to determine which of these twohypothetical results is better

Page 13: similarity measure

Similarity measures based on VSM Dice coefficient Overlap Coefficient Jaccard Cosine Asymmetric Dissimilarity Other measures

PP30

Page 14: similarity measure

Dice Coefficient-Cont’ Definition of Harmonic Mean: To X1,X2, …, Xn ,their harmonic mean E equals n divided

by(1/x1+1/x2+…+1/Xn), that is

RP

E11

2

To Harmonic Mean (E) of Precision (P) and Recall (R)

BA

BA

B

BA

A

BA

22

PP30

n

iin x

n

xxx

nE

121

1111

Page 15: similarity measure

Dice Coefficient-Cont’

Denotation of Dice Coefficient:

)1,0()1(

)1(),(),(

1

2

1

2

1

n

k kj

n

k kq

n

k kjkq

j

ww

ww

BA

BABADdqsim

PP30

EBA

BA

BA

BABAthenDif

2

)1(),(

2

1

α>0.5 : precision is more important

α<0.5 : recall is more important

Usually α=0.5

Page 16: similarity measure

Overlap CoefficientPP30-31

Documents Set

A queries B Documents

n

k

n

k kjkq

n

k kjkq

j

ww

ww

BA

BABAOdqsim

1 1

22

1

),min(

),min(),(),(

Page 17: similarity measure

Jaccard Coefficient-Cont’

Documents Set

A queries B Documents

n

k

n

k kjkq

n

k kjkq

n

k kjkq

j

wwww

ww

BA

BABAJdqsim

1 11

22

1

),(),(

PP31

Page 18: similarity measure

Example- Jaccard Coefficient

D1 = 2T1 + 3T2 + 5T3, (2,3,5)

D2 = 3T1 + 7T2 + T3 , (3,7,1)

Q = 0T1 + 0T2 + 2T3, (0,0,2)

J(D1 , Q) = 10 / (38+4-10) = 10/32 = 0.31

J(D2 , Q) = 2 / (59+4-2) = 2/61 = 0.04

PP31

n

k

n

k kjkq

n

k kjkq

n

k kjkq

j

wwww

ww

dqsim

1 11

22

1

),(

Page 19: similarity measure

Cosine Coefficient-Cont’

n

k kj

n

k kq

n

k kjkq

j

j

j

ww

ww

dq

dq

BA

BAPRBACdqsim

1

2

1

2

1

),(),(

PP31-32

(d21,d22,…d2n)

(d11,d12,…d1n)(q1,q2,…qn)

Page 20: similarity measure

Example-Cosine Coefficient Q = 0T1 + 0T2 + 2T3 D1 = 2T1 + 3T2 + 5T3 D2 = 3T1 + 7T2 + T3

C (D1 , Q) =

=10/ √ (38*4) = 0.81

C (D2 , Q) = 2 / √ (59*4) = 0.13

)200()532(

523020222222

PP31-32

(3,7,1)

(2,3,5)(0,0,2)

Page 21: similarity measure

AsymmetricPP31

n

k kq

n

k kjkqjj

w

wwdqAdqsim

1

1),min(

),(),(

dj

diWki-->wkj

Page 22: similarity measure

Euclidean distance

n

kkjkqjEj wwdqddqdis

1

2)(),(),(

PP32

Page 23: similarity measure

Manhattan block distance

n

kkjkqjMj wwdqddqdis

1

),(),(

PP32

Page 24: similarity measure

Other Measures

We may use priori/context knowledge

For example: Sim(q,dj)= [content identifier similarity]+

[objective term similarity]+

[citation similarity]

PP32

Page 25: similarity measure

ComparisonPP34

Page 26: similarity measure

Comparison

),min(

*2

2

1

2

1

BA

BAO

BA

BAC

BA

BAJ

BA

BAD

BA

Simple matching

Dice’s Coefficient

Cosine Coefficient

Overlap Coefficient

Jaccard’s Coefficient

|A|+|B|-|A∩B| ≥(|A|+|B|)/2

|A| ≥ |A∩B||B| ≥ |A∩B|

(|A|+|B|)/2 ≥√(|A|*|B|)

√(|A|*|B|)≥ min (|A|, |B|)

O≥C≥D≥J

PP34

Page 27: similarity measure

Example-Documents-Term-Query-Cont’D1:A search Engine for 3D Models

D2:Design and Implementation of a string database query language

D3:Ranking of documents by measures considering conceptual dependence between terms

D4 Exploiting hierarchical domain structure to compute similarity

D5:an approach for measuring semantic similarity between words using multiple information sources

D6:determinging semantic similarity among entity classes from different ontologies

D7:strong similarity measures for ordered sets of documents in information retrieval

T1:search(ing) T2:Engine(s) T3:ModelsT4:database T5:query T6:languageT7:documents T8:measur(es,ing) T9:conceptual T10:dependence T11: domain T12:structure T13:similarity T14:semanticT15: ontologiesT16:informationT17: retrieval

Query: Semantic similarity measures used by search engines and other

information searching mechanisms

PP33

Page 28: similarity measure

Example-Term-Document Matrix-Cont’

T1 T2 T3 T4 T5 T6 T7 T8 T9 T10 T11 T12 13 T14 T15 T16 T17

D1 1 1 1

D2 1 1 1

D3 1 1 1 1

D4 1 1 1

D5 1 1 1 1

D6 1 1 1

D7 1 1 1 1 1

Q 2 1 1 1 1 1

Matrix[q][A]

PP34

Page 29: similarity measure

Dice coefficient

)0*0...1*11*1()0*0...1*12*2(

)0*01*0...0*01*01*11*2(*2

n

k k

n

k kq

n

k kkq

n

k k

n

k kq

n

k kkq

ww

ww

ww

wwqDD

1

211

2

1 1

1

211

2

1 1 2)2

1(

)1(),1(

5.012

6

39

)12(*2

PP30, PP34

Page 30: similarity measure

Final ResultsPP34

O≥C≥D≥J

Page 31: similarity measure

Current Applications

Multi-Dimensional Modeling Hierarchical Clustering Bioinformatics

PP35-38

Page 32: similarity measure

Discussion