similarity and dissimilarity

30
Lect 09/10-08-09 1 Similarity and Dissimilarity

Upload: allison-collier

Post on 18-Nov-2014

748 views

Category:

Documents


6 download

TRANSCRIPT

Page 1: Similarity and Dissimilarity

Lect 09/10-08-09 1

Similarity and Dissimilarity

Page 2: Similarity and Dissimilarity

Lect 09/10-08-09 2

• Similarity : numerical measure between two objects to the degree they are alike.

• The value of similarity measure will be high if two objects are very similar.

• If scale is [0,1], then 0 indicates no similarity and I indicates complete similarity.

Page 3: Similarity and Dissimilarity

Lect 09/10-08-09 3

• Dissimilarity: – numerical measure between two objects to

the degree to which they are different.– Diss. Is lower for similar pair of objects.

• Transformations:– Suppose simi. between 2 obj. range from 1

( not at all similar) to 10 (comp. simi.)– They can be transf. with in the range [0,1] as

follows:• S’=(S-1)/9, where S and S’ are original and

transformed similarity values.

Page 4: Similarity and Dissimilarity

Lect 09/10-08-09 4

• It can also be generalized as:– S’=(s - min_s)/(max_s – min_s)– Lly, diss. Can be mapped on to the interval

[0,1]– D’=(d-min_d)/(max_d-min_d).

Page 5: Similarity and Dissimilarity

Lect 09/10-08-09 5

Similarity & Diss. Betn simple attributes

• Consider objects having single attribute– One nominal attr.

• In context of such an attr., simi. means whether the two objects have same value or not.

S=1 if att. values match

S=0 if att. Values doesn’t match

Diss. can be defined similarly in opposite way.

Page 6: Similarity and Dissimilarity

Lect 09/10-08-09 6

• One ordinal attr.– Order is important and should be taken into

account.– Suppose an attr. measures the quality of a

product as follows: {poor, fair, OK, good, wonderful}• Product P1= wonderful, Product P2= good,

P3=OK and so on• In order to make this observation

quantitative say {poor=0, fair=1, OK=2, good = 3, wonderful=4}

• Then d(P1,P2)=4-3=1• Or d(P1,P2) = (4-3)/4=0.25

• A simi. For ordinal attr. can be defined as s=1-d

Page 7: Similarity and Dissimilarity

Lect 09/10-08-09 7

• For interval or ratio attr., the measure of dissi. bet. 2 objects is absolute difference between ordinal attributes.

Page 8: Similarity and Dissimilarity

Lect 09/10-08-09 8

Similarity/Dissimilarity for Simple Attributes

p and q are the attribute values for two data objects.

Page 9: Similarity and Dissimilarity

Lect 09/10-08-09 9

Dissimilarity between data objects

• Distances:• The Euclidean distance bet. Two data objects

or points

where n is the no. of dimensions and xk & yk are the kth attr. of x & y.

n

kkk yxyxd

1

2)(),(

Page 10: Similarity and Dissimilarity

Lect 09/10-08-09 10

Euclidean Distance

0

1

2

3

0 1 2 3 4 5 6

p1

p2

p3 p4

point x yp1 0 2p2 2 0p3 3 1p4 5 1

Distance Matrix

p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

Page 11: Similarity and Dissimilarity

Lect 09/10-08-09 11

Minkowski Distance

• Minkowski Distance is a generalization of Euclidean Distance

Where r is a parameter, n is the number of dimensions (attributes) and pk and qk are, respectively, the kth attributes (components) or data objects p and q.

rn

k

rkk qpdist

1

1)||(

Page 12: Similarity and Dissimilarity

Lect 09/10-08-09 12

Minkowski Distance: Examples• r = 1. City block (Manhattan, taxicab, L1 norm)

distance. – A common example of this is the Hamming distance, which is just the

number of bits that are different between two binary vectors

• r = 2. Euclidean distance

• r . “supremum” (Lmax norm, L norm) distance. – This is the maximum difference between any component of the

vectors

• Do not confuse r with n, i.e., all these distances are defined for all numbers of dimensions.

Page 13: Similarity and Dissimilarity

Lect 09/10-08-09 13

Minkowski Distance

Distance Matrix

point x yp1 0 2p2 2 0p3 3 1p4 5 1

L1 p1 p2 p3 p4p1 0 4 4 6p2 4 0 2 4p3 4 2 0 2p4 6 4 2 0

L2 p1 p2 p3 p4p1 0 2.828 3.162 5.099p2 2.828 0 1.414 3.162p3 3.162 1.414 0 2p4 5.099 3.162 2 0

L p1 p2 p3 p4

p1 0 2 3 5p2 2 0 1 3p3 3 1 0 2p4 5 3 2 0

Page 14: Similarity and Dissimilarity

Lect 09/10-08-09 14

Common Properties of a Distance• Distances, such as the Euclidean distance, have

some well known properties.

1. d(p, q) 0 for all p and q and d(p, q) = 0 only if p = q. (Positive definiteness)

2. d(p, q) = d(q, p) for all p and q. (Symmetry)3. d(p, r) d(p, q) + d(q, r) for all points p, q, and r.

(Triangle Inequality)where d(p, q) is the distance (dissimilarity) between points (data objects), p and q.

• A distance that satisfies these properties is a Metric

Page 15: Similarity and Dissimilarity

Lect 09/10-08-09 15

Example of Non-Metric diss.• Two sets A={1,2,3,4}, B={2,3,4}• A-B={1}, B-A={ }• Define the distance d bet. A and B as

d(A,B)=size (A-B), size is function returning the no. of elts. in a set.

• This dist. measure doesn’t satisfy Positivity property, symmetry property and triangular inequality.

• These properties can hold if diss. Meas. Is modified as follows:

D(A,B)=size (A-B) + size (B-A).

Page 16: Similarity and Dissimilarity

Lect 09/10-08-09 16

Common Properties of a Similarity

• Similarities, also have some well known properties.

1. s(p, q) = 1 (or maximum similarity) only if p = q.

2. s(p, q) = s(q, p) for all p and q. (Symmetry)

where s(p, q) is the similarity between points (data objects), p and q.

Page 17: Similarity and Dissimilarity

Lect 09/10-08-09 17

Similarity Between Binary Vectors• Common situation is that objects, p and q, have

only n binary attributes. • p and q becomes binary vectors & leads to

following 4 quantities.

• Compute similarities using the following quantitiesM01 = the number of attributes where p was 0 and q was 1M10 = the number of attributes where p was 1 and q was 0M00 = the number of attributes where p was 0 and q was 0M11 = the number of attributes where p was 1 and q was 1

• Simple Matching Coefficients ( similarity coeff.)SMC = number of matches / number of attributes

= (M11 + M00) / (M01 + M10 + M11 + M00)

Page 18: Similarity and Dissimilarity

Lect 09/10-08-09 18

SMC versus Jaccard: ExampleJ = number of 11 matches / number of not-both-zero

attributes values = (M11) / (M01 + M10 + M11)

EX. p = (1,0, 0, 0, 0, 0, 0, 0, 0, 0) q = (0, 0, 0, 0, 0, 0, 1, 0, 0,1)

M01 = 2 (the number of attributes where p was 0 and q was 1)M10 = 1 (the number of attributes where p was 1 and q was 0)M00 = 7 (the number of attributes where p was 0 and q was 0)M11 = 0 (the number of attributes where p was 1 and q was 1)

SMC = (M11 + M00)/(M01 + M10 + M11 + M00) = (0+7) / (2+1+0+7) = 0.7

J = (M11) / (M01 + M10 + M11) = 0 / (2 + 1 + 0) = 0

Page 19: Similarity and Dissimilarity

Lect 09/10-08-09 19

Cosine Similarity• This similarity measure is used for

documents.

• Doc. are often represented as vectors

• Each attribute represents the frequency of each word.

• Each document have thousands & tens of thousands of attributes.

• The Cosine simi. is one of the most common measure of document similarity.

Page 20: Similarity and Dissimilarity

Lect 09/10-08-09 20

Cosine Similarity• If d1 and d2 are two document vectors, then

cos( d1, d2 ) = (d1 d2) / ||d1|| ||d2|| , where indicates vector dot product and || d || is the length of vector d.

• Example: consider 2 document vectors as: d1 = (3, 2, 0, 5, 0, 0, 0, 2, 0, 0)

d2 = (1, 0, 0, 0, 0, 0, 0, 1, 0, 2)

d1 d2= 3*1 + 2*0 + 0*0 + 5*0 + 0*0 + 0*0 + 0*0 + 2*1 + 0*0 + 0*2 = 5

||d1|| = (3*3+2*2+0*0+5*5+0*0+0*0+0*0+2*2+0*0+0*0)0.5 =(42)0.5 = 6.481

||d2|| = (1*1+0*0+0*0+0*0+0*0+0*0+0*0+1*1+0*0+2*2) 0.5 = (6) 0.5 = 2.245

cos( d1, d2 ) = 0.3150

Page 21: Similarity and Dissimilarity

Lect 09/10-08-09 21

• If cosine simi.= 1, then angle bet x and y is 0 and x & y are same except for magnitude and vice-versa.

θ

y

x

Page 22: Similarity and Dissimilarity

Lect 09/10-08-09 22

Extended Jaccard Coefficient (Tanimoto)

• Can be used for document data

• Reduces to Jaccard for binary attributes

Page 23: Similarity and Dissimilarity

Lect 09/10-08-09 23

Correlation• Correlation measures the linear relationship

between objects

• Example (Perfect Correlation): the following 2 sets of values for x and y indicate cases where corr. Is -1 and +1.

• x=(-3,6,0,3,-6) and y=(1,-2,0,-1,2) [COMPUTE]• x=(3,6,0,3,6) and y=(1,2,0,1,2)[COMPUTE]

Page 24: Similarity and Dissimilarity

Lect 09/10-08-09 24

• To compute correlation, we standardize data objects, p and q, and then take their dot product

)(/))(( pstdpmeanpp kk

)(/))(( qstdqmeanqq kk

qpqpncorrelatio ),(

Page 25: Similarity and Dissimilarity

Lect 09/10-08-09 25

Visually Evaluating Correlation•Scatter plots showing the similarity from –1 to 1.

•It is easy to visualize the

correlation between 2 data

objects x and y by plotting pairs

of corresponding attribute

values.

•This figure shows a no. of

such plots where x and y

has 30 attributes. Each circle

in the plot repres. one of

the 30 attr. Its x-coord. is the

value of one of the att. of x

while its y coord. is the value

of the same att. for y.

Page 26: Similarity and Dissimilarity

Lect 09/10-08-09 26

Issues in computing proximity measures

• 1. Handling the cases where attr. have different scales/ & or correlated.

• 2. Computing proximity between objects when they have different types of attr. (qualitative/quantitative).

• 3. Handling attr. having different weights i.e. when all attr. don’t contribute equally to the proximity of objects.

Page 27: Similarity and Dissimilarity

Lect 09/10-08-09 27

Mahalanobis Distance

Tqpqpqpsmahalanobi )()(),( 1

For red points, the Euclidean distance is 14.7, Mahalanobis distance is 6.

is the covariance matrix of the input data X

Mahalanobis distance is useful when attr. Are correlated and have different ranges of values and distribution is approx. Gaussian.

Page 28: Similarity and Dissimilarity

Lect 09/10-08-09 29

Combining Similarities for heterogeneous attributes

• Sometimes attributes are of many different types, but an overall similarity is needed.

Page 29: Similarity and Dissimilarity

Lect 09/10-08-09 30

Using Weights to Combine Similarities

• When some attr. are more important for calculating proximity than others

• May not want to treat all attributes the same.

– Use weights wk which are between 0 and 1 and sum to 1, then previous eqn. becomes

The Minkowski distance becomes:

Page 30: Similarity and Dissimilarity

Lect 09/10-08-09 34

Review Questions (Unit 2)• 1. What are the issues related to data quality? Discuss.• 2. What is the difference between noise and an outlier?• 3. Data quality issues can be considered from an application point of view.

Discuss.• 4. Write a short note on Data Preprocessing.• 5. Discuss the difference between similarity and dissimilarity.• 6. State 3 axioms that makes a distance a metric.• 7. Discuss similarity measures for Binary variables.• 8. What is SMC?• 9. How cosine similarity is used for measuring similarity between 2

document vectors.• 10. Discuss issues related to computation of proximity measures.• 11.For the following vectors x, y calculate the indicated similarity or distance

measure:(a) X=(1,1,1,1) y=(2,2,2,2) cosine, correlation, Euclidean.(b) x=(0,1,0,1), y=(1,0,1,0) cosine, correlation, Euclidean, Jaccard,

Hamming Distance( for binary data L1 distance corresponds to Hamming distance).