lecture02 03 similaritymetrices datavisualization
TRANSCRIPT
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
1/53
Distance or Similarity Measures
Many data mining and analytics tasks involve the comparison of objects and
determining in terms of their similarities (or dissimilarities)
Clustering
Nearest-neighbor search classification and prediction
Correlation analysis
Many of todays real-!orld applications rely on the computation similarities or
distances among objects
"ersonali#ation
$ecommender systems
%ocument categori#ation &nformation retrieval
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
2/53
Similarity and Dissimilarity
'imilarity
Numerical measure of ho! alike t!o data objects are alue is higher !hen objects are more alike
ften falls in the range *+,
%issimilarity (e.g. distance)
Numerical measure of ho! different t!o data objects are
/o!er !hen objects are more alike
Minimum dissimilarity is often +
0pper limit varies
"ro1imity refers to a similarity or dissimilarity
2
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
3/53
3
Distance or Similarity Measures
Measuring %istance
&n order to group similar items !e need a !ay to measure thedistance bet!een objects (e.g. records)
ften re2uires the representation of objects as 3feature vectors4
ID Gender Age Salary
1 F 27 19,000
2 M 51 64,000
3 M 52 100,000
4 F 33 55,000
5 M 45 45,000
T1 T2 T3 T4 T5 T6
Doc1 0 4 0 0 0 2
Doc2 3 1 4 3 1 2
Doc3 3 0 0 0 3 0
Doc4 0 1 0 3 0 0
Doc5 2 2 2 3 1 4
An Employee DB Term Frequencies for Documents
Feature vector corresponding to
Employee 2:
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
4/53
Distance or Similarity Measures
"roperties of %istance Measures5
for all objects 6 and 7 dist(6 7) + and dist(6 7) 8dist(7 6)
for any object 6 dist(6 6) 8 +
$epresentation of objects as vectors5
9ach data object (item) can be vie!ed as an n-dimensional vector !here
the dimensions are the attributes (features) in the data
:he vector representation allo!s us to compute distance or similarity
bet!een pairs of items using standard vector operations e.g. Cosine of the angle bet!een vectors
Manhattan distance
9uclidean distance
;amming %istance
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
5/53
Data Matrix and Distance Matrix
%ata matri1
Conceptual representation of a table Cols 8 features< ro!s 8 data objects
ndata points !ithpdimensions
9ach ro! in the matri1 is the vector
representation of a data object
%istance (or 'imilarity) Matri1
n data points but indicates only the
pair!ise distance (or similarity)
6 triangular matri1
'ymmetric
npx...
nfx...
n1x
...............
ipx...ifx...i1x
...............1px...1fx...11x
%&&&(2)(")
:::
(23)(
...ndnd
0dd(3,1
0d(2,1)
0
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
6/53
Proximity Measure for Nominal Attributes
If object attributes are all nominal (categorical), then
proximity measure are used to compare objects
Can take 2 or more states, e.g., red, yellow, blue, green(generaliation of a binary attribute)
!ethod "# $imple matching
m# % of matches,p# total % of &ariables
!ethod 2# Con&ert to $tandard $preadsheet format 'or each attributeA create Mbinary attribute for theMnominal states ofA
hen use standard &ectorbased similarity or distancemetrics
pmpjid
=)(
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
7/53
* + contingency table for binary data
* $imple matching coecient
dcbacbjid+++
+=)(
Proximity Measure for Binary Attributes
pdbcasum
dcdc
baba
sum
++
+
+
+
,
+,
*+,ect i
*+,ect j
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
8/53
Proximity Measure for Binary Attributes
* -xample
/et the &alues 0 and 1 be set to ", and the &alue be set to3
-!&%2""
2"()
#-&%"""
""()
33&%"%2
"%()
maryjimd
jimjackd
maryjackd
pdbcasum
dcdc
baba
sum
++
+
+
+
,
+,
cbacbjid++
+=)(
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
9/53.
Common Distance Measures for Numeric Data
Consider t!o vectors
$o!s in the data matri1
Common %istance Measures5
Manhattan distance5
9uclidean distance5
%istance can be defined as a dual of a similarity measure
( ) , ( )dist X Y sim X Y = = =( )
( )i i
i
i ii i
x ysim X Y
x y
=
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
10/53
Data Normalization for Numeric Data
* + database can contain nnumbers of numeric
attributes* + larger range attribute or noise can shift thesamples distances
* 'or example# 4Income5 attribute can dominate thedistance as compared to 46eight5 and 4+ge5
attributes
* he objecti&e of normaliation is con&ert allnumeric attributes, so that there &alues fall within asmall speci7ed range, such as 3 to ".3
* ormaliation is particularly useful for clusteringand distance based algorithms such as knearestneighbor
Attr1 Attr2 Attr3 Attr4
"2 28 99 9:
22 2: 92 9;
2; 28 98 9"2"2 29 9; 9;
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
11/53
* min-max normalization
-xample# $uppose that the minimumandmaximum&alues for the attribute income are"2,333 and y minmax normaliation, a&alue of ?9,;33for income is transformed to
@ ((?9,333"2,333)A(
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
12/53
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
13/53
Example: Data Matrix and Distance Matrix
Distance Matrix (Euclidean)
Data Matrix
Distance Matrix (Manhattan)
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
14/53
Vector-Based Similarity Measures
&n some situations distance measures provide a ske!ed vie! of data
9.g. !hen the data is very sparse and +@s in the vectors are notsignificant
&n such cases typically vector-based similarity measures are
used
Most common measure5 Cosine similarity
%ot product of t!o vectors5
Cosine 'imilarity 8 normali#ed dot product
the norm of a vector A is5
the cosine similarity is5
=i
ixX =
=
=
i
i
i
i
i
ii
yx
yx
yX
YXYXsim
==
)(
)(
, = nX x x x= / , = nY y y y= /
==i
ii yxYXYXsim )(
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
15/53
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
16/53
Example (Vector Space Model)
* G# gold sil&er truck
*H"#
$hipment of gold damaged in a 7re
* H2# Heli&ery of sil&er arri&ed in a sil&er truck
* H9# $hipment of gold arri&ed in a truck
he idf for the terms in the three documents is gi&en below#
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
17/5317
Example (Vector Space Model)
* Weight the items of ectors using tf !eighting
schemecell contain ra! term fre"uenc# of
term !ithin document
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
18/53
Documents # $uery in n-dimensional Space
"/
Documents are reresented as !ectors in t"e term sace Tyically !alues in eac" dimension corresond to t"e #re$uency o#
t"e corresonding term in t"e document
%ueries reresented as !ectors in t"e same !ector&sace
'osine similarity (et)een t"e $uery and documents is
o#ten used to ran* retrie!ed documents
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
19/53".
Example: Similarities amon% Documents
Consider the follo!ing document-term matri1
T1 T2 T3 T4 T5 T6 T7 T+
Doc1 0 4 0 0 0 2 1 3
Doc2 3 1 4 3 1 2 0 1
Doc3 3 0 0 0 3 0 3 0
Doc4 0 1 0 3 0 0 2 0
Doc5 2 2 2 3 1 4 0 2
Dot01roduct)Doc2Doc$(
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
20/53
!an& Correlation as Similarity
&n cases !here !e !ant to analy#e correlation among
t!o features or !e have high mean variance across
data objects (e.g. movies product ratings) $ank
Correlation coefficient is the best option
'pearman?s rank correlation coefficient Bendall tau rank correlation coefficient
ften used in recommender systems based on
Collaborative iltering
2%
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
21/53
!an& Correlation as Similarity
* 1roducts ating
Hell, +pple, $amsung, +cer, J1, $ony
* Kser" atings#
Hell(=A"3), +pple (8A"3), $amsung(
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
22/53
!an& Correlation as Similarity
* 6hich Kser is most to Kser"
* Kser" and Kser2
* Kser" and Kser9
Rank Correlation as Similarity
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
23/53
Rank Correlation as Similarity
(Example)
* 'ind Correlation between $tudentIG and Jouse of L per week
* diis the distance between ranksof two attributes
* n total samples
Rank Correlation as Similarity
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
24/53
Rank Correlation as Similarity
(Example)
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
25/53
D i Ti W i
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
26/53
Dynamic Time Warping!erndt" Cli##ord" $%%&'
* +llows accelerationdeceleration of signals along the
time dimension
* >asic idea
Consider M x", x2, N, xn, and 0 y", y2, N, yn
6e are allowed to extend each seOuence byrepeating elements
-uclidean distance now calculated between theextended seOuences Mand 0
!atrix !, where mij d(xi, yj)
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
27/53
Example
,uclidean distance !s DT-
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
28/53
o to Calc*late DTW
* $teps
Histance table calculation Calculating shortest path in the table
* -xample# $eries" P2.3, 9.3, ?.3, ?.3, =.3, =.3, ;.3, 2.3,8.3, 2.3, :.3, 8.3, 8.3 Q
$eries2 P:.3, ?.3, ?.3, =.3, 2.3, 2.3, ;.3, 8.3,
9.3, :.3, 8.3, 8.3 Q
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
29/53
o to Calc*late DTW
* Histance able Calculation
Distance Matrix comuting
dist.i/0 2 dist.si/ $0 3 min 4dist.i&1/0&1/ dist.i/ 0&1/ dist.i&1/0
-
7/24/2019 LECTURE02 03 SimilarityMetrices DataVisualization
30/53
Example $
s" s2 s9 s: s8 s; s? s= s