distance functions on hierarchies

38
Distance Functions on Hierarchies Eftychia Baikousi

Upload: quincy

Post on 08-Jan-2016

39 views

Category:

Documents


1 download

DESCRIPTION

Distance Functions on Hierarchies. Eftychia Baikousi. Outline. Definition of metric & similarity Various Distance Functions Minkowski Set based Edit distance Basic concept of OLAP Lattice Distance in same level of hierarchy Distance in different level of hierarchy. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Distance Functions on Hierarchies

Distance Functions on Hierarchies

Eftychia Baikousi

Page 2: Distance Functions on Hierarchies

Outline

Definition of metric & similarity Various Distance Functions

Minkowski Set based Edit distance

Basic concept of OLAP Lattice Distance in same level of hierarchy Distance in different level of hierarchy

Page 3: Distance Functions on Hierarchies

Definition of metric

A distance function on a given set M is a function d:MxM , that satisfies the following conditions: d(x,y)≥0 and d(x,y)=0 iff x=y

Distance is positive between two different points and is zero precisely from a point to itself

It is symmetric: d(x,y)=d(y,x) The distance between x and y is the same in either

direction It satisfies the triangle inequality: d(x,z) ≤ d(x,y)+ d(y,z)

The distance between two points is the shortest distance along any path

Is a metric

Page 4: Distance Functions on Hierarchies

Definition of similarity metric

Let s(x,y) be the similarity between two points x and y, then the following properties hold:

s(x,y) =1 only if x=y (0≤ s ≤1)

s(x,y) =s(y,x) x and y (symmetry)

The triangle inequality does not hold

Page 5: Distance Functions on Hierarchies

Outline

Definition of metric & similarity Various Distance Functions

Minkowski Set based Edit distance

Basic concept of OLAP Lattice Distance in same level of hierarchy Distance in different level of hierarchy

Page 6: Distance Functions on Hierarchies

Minkowski Family

norm-1, City-Block, Manhattan

L1(x,y)= Σi |xi-yi|

norm-2, Euclidian L2(x,y)=(Σi |xi-yi|2 )1/2

norm-p, Minkowski Lp(x,y)=(Σi |xi-yi|p )1/p

infinity norm L=limp (Σi |xi-yi|p )1/p

=maxi (|xi-yi|)

Page 7: Distance Functions on Hierarchies

Set Based

Simple matching coefficient

Jaccard Coefficient

Extended Jaccard, Tanimoto (Vector based)

Cosine (Vector based)

Dice’s coefficient

|BA||BA|

1)B,A(J

yxyx

)y,xcos(

attributes_#values_attribute_matching_#

SMC

BABA

BA)B,A(T22

|Y||X||YX|2

s

Page 8: Distance Functions on Hierarchies

Edit Distance- Levenshtein distance Edit distance between two strings

x=x1 ….xn, y=y1…ym

is defined as the minimum number of atomic edit operations needed

Insert : ins(x,i,c)=x1x2…xicxi+1…xn

Delete : del(x,i)=x1x2…xi-1xi+1…xn

Replace : rep(x,i,c)=x1x2…xi-1cxi+1…xn

Assign cost for every edit operation c(o)=1

Page 9: Distance Functions on Hierarchies

Edit distances Needleman-Wunch distance or Sellers Algorithm

Insert a character ins(x,i,c)=x1x2…xicxi+1…xn

with cost(o)=1 a gap ins_g(x,i,g)=x1x2…xigxi+1…xn

with cost(o)=g Delete

a character del(x,i)=x1x2…xi-1xi+1…xn with cost(o)=1

a gap del_g(x,i)=x1x2…xi-1xi+1…xn with cost(o)=g

Replace a character rep(x,i,c)=x1x2…xi-1cxi+1…xn

with cost(o)=1

Page 10: Distance Functions on Hierarchies

Edit distances

Jaro distance Let two strings s and t and

s’= characters in s that are common with t t’ = characters in t that are common with s Ts,t =number of transportations of characters in s’ relative to

t’ )

|'s|2

T|s|

|t||'t|

|s||s|

(31)t,s(Jaro 't,'s

Page 11: Distance Functions on Hierarchies

Edit distances

Jaro distance Example Let s =MARTHA and t =MARHTA

|s’|=6 |t’|=6 Ts,t = 2/2 since mismatched characters are T/H and H/T

8055.0)1216

66

66(

31

)|'s|2

T|s|

|t||'t|

|s||s|

(31)t,s(Jaro 't,'s

Page 12: Distance Functions on Hierarchies

Edit distances

Jaro Winkler JWS(s,t)= Jaro(s,t) + ((prefixLength *

PREFIXSCALE * (1.0-Jaro(s,t))) Where:

prefixLength : the length of common prefix at the start of the string

PREFIXSCALE: a constant scaling factor which gives more favourable ratings to strings that match from the beginning for a set prefix length

Page 13: Distance Functions on Hierarchies

Edit distances

Jaro Winkler Example Let s =MARTHA and t =MARHTA and

PREFIXSCALE = 0.1 Jaro(s,t)=0.8055 prefixLength=3

JWS(s,t)= Jaro(s,t) + ((prefixLength * PREFIXSCALE * (1.0-Jaro(s,t)))

= 0.8055 + (3*0.1*(1-0.8055)) = 0.86385

Page 14: Distance Functions on Hierarchies

Outline

Definition of metric & similarity Various Distance Functions

Minkowski Set based Edit distance

Basic concept of OLAP Lattice Distance in same level of hierarchy Distance in different level of hierarchy

Page 15: Distance Functions on Hierarchies

Βασικές Έννοιες OLAP Αφορά την ανάλυση κάποιων μετρήσιμων μεγεθών

(μέτρων) πωλήσεις, απόθεμα, κέρδος,...

Διαστάσεις: παράμετροι που καθορίζουν το περιβάλλον (context) των μέτρων ημερομηνία, προϊόν, τοποθεσία, πωλητής, …

Κύβοι: συνδυασμοί διαστάσεων που καθορίζουν κάποια μέτρα Ο κύβος καθορίζει ένα πολυδιάστατο χώρο διαστάσεων, με τα

μέτρα να είναι σημεία του χώρου αυτού

Page 16: Distance Functions on Hierarchies

Κύβοι για OLAP

REGION

NS

WPRODUCT

Juice

Cola

Soap

MONTHJan

10

13

Page 17: Distance Functions on Hierarchies

Κύβοι για OLAP

Page 18: Distance Functions on Hierarchies

Βασικές Έννοιες OLAP

Τα δεδομένα θεωρούνται αποθηκευμένα σε ένα πολυδιάστατο πίνακα (multi-dimensional array), ο οποίος αποκαλείται και κύβος ή υπερκύβος (Cube και HyperCube αντίστοιχα).

Ο κύβος είναι μια ομάδα από κελιά δεδομένων (data cells). Κάθε κελί χαρακτηρίζεται μονοσήμαντα από τις αντίστοιχες τιμές των διαστάσεων (dimensions) του κύβου.

Τα περιεχόμενα του κελιού ονομάζονται μέτρα (measures) και αναπαριστούν τις αποτιμώμενες αξίες του πραγματικού κόσμου.

Page 19: Distance Functions on Hierarchies

Ιεραρχίες επιπέδων για OLAP Μια διάσταση μοντελοποιεί όλους τους τρόπους με

τους οποίους τα δεδομένα μπορούν να συναθροιστούν σε σχέση με μια συγκεκριμένη παράμετρο του περιεχομένου τους. Ημερομηνία, Προϊόν, Τοποθεσία, Πωλητής, …

Κάθε διάσταση έχει μια σχετική ιεραρχία επιπέδων συνάθροισης των δεδομένων (hierarchy of levels). Αυτό σημαίνει, ότι η διάσταση μπορεί να θεωρηθεί από πολλά επίπεδα αδρομέρειας. Ημερομηνία: μέρα, εβδομάδα, μήνας, χρόνος, …

Page 20: Distance Functions on Hierarchies

Ιεραρχίες Επιπέδων

Ιεραρχίες Επιπέδων: κάθε διάσταση οργανώνεται σε διαφορετικά επίπεδα αδρομέρειας

Ο χρήστης μπορεί να πλοηγηθεί από το ένα επίπεδο στο άλλο, δημιουργώντας νέους κύβους κάθε φορά

Αδρομέρεια: το αντίθετο της λεπτομέρειας

-- ο σωστός όρος είναι αδρομέρεια...

Year

Month Week

Day

Page 21: Distance Functions on Hierarchies

Κύβοι & ιεραρχίες διαστάσεων για OLAP

Διαστάσεις: Product, Region, Date

Ιεραρχίες διαστάσεων:

Month

Regio

n

Pro

duct

Sales volume

Industry

Category

Product

Country

Region

City

Store

Year

Quarter

Month Week

Day

Page 22: Distance Functions on Hierarchies

Outline

Definition of metric & similarity Various Distance Functions

Minkowski Set based Edit distance

Basic concept of OLAP Lattice Distance in same level of hierarchy Distance in different level of hierarchy

Page 23: Distance Functions on Hierarchies

Lattice A lattice is a partially ordered set (poset) in which

every pair of elements has a unique supremum and an inifimum

The hierarchy of levels is formally defined as a lattice (L,<) such that L= (L1, ..., Ln, ALL) is a finite set of levels and < is a partial order defined among the levels of L such that L1<Li<ALL 1≤i≤n.

the upper bound is always the level ALL, so that we can group all values into the single value ‘all’.

The lower bound of the lattice is the most detailed level of the dimension.

Page 24: Distance Functions on Hierarchies

Outline

Definition of metric & similarity Various Distance Functions

Minkowski Set based Edit distance

Basic concept of OLAP Lattice

Distance in same level of hierarchy Distance in different level of hierarchy

Page 25: Distance Functions on Hierarchies

Distances in the same level of Hierarchy Let a dimension D, its levels of hierarchies L1<Li<ALL and

two specific values x and y s.t. x, y Li All

L2

L1

Page 26: Distance Functions on Hierarchies

Distances in the same level of Hierarchy Explicit Minkowski Set Based Highway With respect to the detailed level Attribute Based

Page 27: Distance Functions on Hierarchies

Distances in the same level of Hierarchy Explicit assignment

n2 distances for the n values of the dom(Li)

Minkowski family reduce to the Manhattan distance: |x-y|

Set based family reduced to {0, 1}, where

yifx,1

yifx,0)y,x(dist

Page 28: Distance Functions on Hierarchies

Distances in the same level of Hierarchy Highway distance

Let the values of level Li form a set of k clusters, where each cluster has a representative rk

dist(x, y)= dist(x, rx)+ dist(rx, ry)+ dist(y, ry) Specify

k2 distances: dist (rx, ry) and

k distances: dist(x, rx)

Page 29: Distance Functions on Hierarchies

Distances in the same level of Hierarchy With respect to the detailed level

f is a function that picks one of the descendants Attribute based

level L attributes: v [v1 … vn] dom(L) Distance can be defined with respect to the attributes

Ln

L2

L1 a,...,a,a

))y(desc),x(desc(f)y,x(dist y

11

LL

LxL

Page 30: Distance Functions on Hierarchies

Outline

Definition of metric & similarity Various Distance Functions

Minkowski Set based Edit distance

Basic concept of OLAP Lattice Distance in same level of hierarchy

Distance in different level of hierarchy

Page 31: Distance Functions on Hierarchies

Distances in different levels of Hierarchy Explicit dist1+ dist2

dist3+dist4

With respect to the detailed level With respect to their least common ancestor Highway Attribute Based

Page 32: Distance Functions on Hierarchies

Distances in different levels of Hierarchy

Let a dimension D, its levels of hierarchies L1<Li<ALL two specific values x and y s. t.

x Lx y Ly

Lx<Ly

ancestor of x in level Ly

a descendant of y in level Lx

yx

xy

Ly

x

y

dist1dist3

dist2

dist4

Lx)x(ancx y

x

LLy

)y(descy y

x

LLx

Page 33: Distance Functions on Hierarchies

Explicit assignment define distLx,Ly(x, y) x Lx, y Ly

dist1 +dist2

Where is a distance of two

values from the same level of hierarchy

special case: y is an ancestor of x then dist2=0

Distances in different levels of Hierarchy

)y),x(anc(dist))x(anc,x(distdistdist y

x

y

x

LL

LL21

)y),x(anc(dist y

x

LL

yx

xy

Ly

x

y

dist1 dist3

dist2

dist4

Lx

Page 34: Distance Functions on Hierarchies

Distances in differentlevels of Hierarchy

dist3 +dist4

Where a distance of two values from the same

level of hierarchy

special case: y is an ancestor of x then dist4=0

)x),y(desc(dist))y(desc,y(distdistdist y

x

y

x

LL

LL43

)x),y(desc(dist y

x

LL

yx

xy

Ly

x

y

dist1 dist3

dist2

dist4

Lx

Page 35: Distance Functions on Hierarchies

Distances in different levels of Hierarchy With respect to the detailed level

Let and

Where dist(x1, y1) a distance of two values

from the same level of hierarchy

))x(desc(fx x

1

LL1 ))y(desc(fy y

1

LL1

)y,y(dist)y,x(dist)x,x(dist)y,x(dist 1111

Page 36: Distance Functions on Hierarchies

Distances in different levels of Hierarchy With respect to their common ancestor

Let Lz the level of hierarchy where x and y have their first common ancestor

number of “hops” needed to reach the first common ancestor

normalizing according to the height of the level

)y),y(anc(dist))x(anc,x(dist)y,x(dist z

y

z

x

LL

LL

Page 37: Distance Functions on Hierarchies

Distances in different levels of Hierarchy Highway distance

Let every Li is clustered into ki clusters and every cluster has its own representative rki

Attribute Based

level L attributes: v [v1 … vn] dom(L) Distance can be defined with respect to the

attributes

)y,r(dist)r,r(dist)r,x(dist)y,x(dist yyxx

Page 38: Distance Functions on Hierarchies

Types of Levels

Nominal = values hold the distinctness property values can be explicitly distinguished

Ordinal < > values hold the distinctness property & the order property values abide by an order

Interval + - values hold the distinctness, order & the addition property a unit of measurement exists there is meaning of the difference between two values