dissimilarities and matching between symbolic objects prof. donato malerba department of...

50
DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy [email protected] ASSO School Athens, Greece October 6-8, 2003

Upload: giulietta-palumbo

Post on 01-May-2015

220 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS

Prof. Donato MalerbaDepartment of Informatics, University of Bari, [email protected]

ASSO SchoolAthens, GreeceOctober 6-8, 2003

Page 2: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

2

COMPUTING DISSIMILARITIES: WHY?

• Several data analysis techniques are based on quantifying a dissimilarity (or similarity) measure between multivariate data. • Clustering• Discriminant analysis• Visualization-based approaches

• Symbolic objects are a kind of multivariate data.

• Ex.: [colour={red, black}][weight {60,70,80}][height []1.50,1.60]

• The dissimilarity measures presented here are among those investigated in the ASSO Project.

Page 3: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

3

A case study

• Abalone features survey

• Abalones are members of a large class (Gastropoda) of molluscs having one-piece shells.

• 4177 cases of marine crustaceans described by the following attributes:

Attribute Name

Data Type Unit of meas.

Description

Sex Nominal M, F. I (inf ant) Length Continuous mm Longest shell measurement Diameter Continuous mm Perpendicular to length Height Continuous mm Measured with meat in shell Whole weight Continuous grams Weight of the whole abalone Shucked weight Continuous grams Weight of the meat Viscera weight Continuous grams Gut weight af ter bleeding Shell weight Continuous grams Weigh of the dried shell Rings I nteger Number of rings

Page 4: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

4

The construction of SO

DB2SO: facility of the ASSO system to generate (Boolean or Probabilistic) symbolic objects from relational databases.

Input:

• a set of groups or classes C1, C2, …, CK

• a set of n individuals k each of which is described by p variables Y1, …, Yp and is assigned to one or more groups

Output:

•a set of K symbolic objects ei described by p variables Y1, …, Yp

Example: Nine symbolic objects, one for each interval of: Number of rings

Page 5: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

5

TABLE OF BOOLEAN SYMBOLIC OBJECTS

Page 6: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

6

COMPUTATION OF DISSIMILARITIES BETWEEN SYMBOLIC OBJECTS

Dissimilarity matrixDissimilarity matrix

SO 1 2 3 4 1 0.0000 2 0.2053 0.0000 3 12.8626 15.0793 0.0000 4 14.0338 15.0403 8.6463 0.0000 …

Page 7: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

7

The MID property

the degree of dissimilarity between crustaceans computed on the independent attributes should be proportional to the dissimilarity in the dependent attribute (i.e., the difference in the number of rings). This property is called monotonic increasing dissimilarity (MID).

Page 8: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

8

The MID property

Abalone - U_ 1

0

3

6

9

12

15

1-3 4-6 7-9 10-12

13-15

16-18

19-21

22-24

25-29

Abalone - U_ 2

0

0,5

1

1,5

2

2,5

1-3 4-6 7-9 10-12

13-15

16-18

19-21

22-24

25-29

Abalone - U_ 4

0

0,05

0,1

0,15

0,2

1-3 4-6 7-9 10-12

13-15

16-18

19-21

22-24

25-29

Abalone - SO_1

00,10,20,30,4

1-3 4-6 7-9 10-12

13-15

16-18

19-21

22-24

25-29

Abalone - SO_2

00,050,1

0,150,2

0,25

1-3 4-6 7-9 10-12

13-15

16-18

19-21

22-24

25-29

Abalone - SO_3

00,30,60,91,21,5

1-3 4-6 7-9 10-12

13-15

16-18

19-21

22-24

25-29

Abalone - SO_4

00,10,20,30,4

1-3 4-6 7-9 10-12

13-15

16-18

19-21

22-24

25-29

Abalone - SO_5

00,30,60,91,2

1-3 4-6 7-9 10-12

13-15

16-18

19-21

22-24

25-29

Abalone - U_3

00,30,60,91,21,5

1-3 4-6 7-9 10-12

13-15

16-18

19-21

22-24

25-29

Abalone - C_1

00,30,60,91,2

1-3 4-6 7-9 10-12

13-15

16-18

19-21

22-24

25-29

The degree of dissimilarity between crustaceans computed on the independent attributes should be proportional to the dissimilarity in the dependent attribute (i.e., the difference in the number of rings). This property is called monotonic increasing dissimilarity (MID).

Page 9: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

9

BOOLEAN SYMBOLIC OBJECTS (BSO’S)

A BSO is a conjunction of boolean elementary events: [Y1=A1] [Y2=A2] ... [Yp=Ap]

where each variable Yi takes values in Yi and Ai is a subset of Yi Let a and b be two BSO’s:a = [Y1=A1] [Y2=A2] ... [Yp=Ap]

b = [Y1=B1] [Y2=B2] ... [Yp=Bp]

where each variable Yj takes values in Yj and Aj and Bj are subsets of Yj. We are interested to compute the dissimilarity d(a,b).

Page 10: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

10

CONSTRAINED BSO’S

Two types of dependencies between variables:

• Hierarchical dependence (mother-daughter): A variable Yi may be inapplicable if another variable

Yj takes its values in a subset Sj Yj. This

dependence is expressed as a rule:

if [Yj = Sj] then [Yi = NA]

• Logical dependence: This case occurs, if a subset

Sj Yj of a variable Yj is related to a subset Si Yi of

a variable Yi by a rule such as:

if [Yj = Sj] then [Yi = Si]

Page 11: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

11

DISSIMILARITY AND SIMILARITY MEASURES

Dissimilarity MeasureDissimilarity Measure d: EER such that d*

a = d(a,a) d(a,b) = d(b,a) < a,bE

Similarity MeasureSimilarity Measures: EE R such that s*

a = s(a,a) s(a,b) = s(b,a) 0 a,bE

Generally: a E: d*

a = d* and s*a= s* and specifically, d* = 0 while s*= 1

Dissimilarity measures can be transformed Dissimilarity measures can be transformed into similarity measures (and viceversa):into similarity measures (and viceversa):

d=(s) ( s=-1(d) )where:(s) strictly decreasing function, and (1) = 0, (0) =

Page 12: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

12

DISSIMILARITY AND SIMILARITY MEASURES: PROPERTIES

Some properties that a dissimilarity measure d on E may satisfy are: 

1. d(a, b) = 0 c E: d(a, c) = d(b, c) (eveness)

2. d(a, b) = 0 a = b (definiteness)

3. d(a, b) d(a, c) + d(c, b) (triangle inequality)

4. d(a, b) max(d(a, c), d(c, b)) (ultrametric inequality )

5. d(a, b) + d(c, d) max(d(a, c) + d(b, d), d(a, d) +d(b, c)) (Buneman's inequality)

6. Let (E, +) be a group, then d(a, b) = d(a+c, b+c) (translation invariance )

A dissimilarity function that satisfies proprieties 2 and 3 is called metric.

A dissimilarity function that satisfies only property 3 is called pseudo metric or semi- distance.

Page 13: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

13

DISSIMILARITY MEASURES BETWEEN BSO’S

Author(s) (Year) Notation from the SODAS Package • Gowda & Diday (1991) U_1• Ichino & Yaguchi (1994) U_2, U_3, U_4• De Carvalho (1994) SO_1, SO_2• De Carvalho (1996, 1998) SO_3, SO_4, SO_5, C_1

U: only for unconstrained BSO’sC: only for constrained BSO’sSO: for both constrained and unconstrained BSO’s

Page 14: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

14

GOWDA & DIDAY’S DISSIMILARITY MEASURE

Gowda & Diday’s dissimilarity measures for two BSO’s a and b:

U_1

If Yj is a continuous variable:

D(Aj, Bj) = D(Aj, Bj) + Ds(Aj, Bj) + Dc(Aj, Bj)

while if Yj is a nominal variable:

D(Aj, Bj) = Ds(Aj, Bj) + Dc(Aj, Bj)

where the components are defined so that their values are normalized between 0 and 1:

• D(Aj, Bj) due to position,

• Ds(Aj, Bj) due to span,

• Dc(Aj, Bj) due to content

),(1

jj

p

jBAD

D(a, b) =

Aj Bj

D DsDc

Page 15: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

15

GOWDA & DIDAY’S DISSIMILARITY MEASURE

Properties:Properties:D(a, b) = 0 a = b (definiteness property), No proof is reported for the triangle inequality

property

Page 16: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

16

ICHINO & YAGUCHI’S DISSIMILARITY MEASURES

Ichino & Yaguchi’s dissimilarity measures are based on the Cartesian operators join and meet . For continuous variables:

Aj Bj

Aj Bj

while for nominal variables:

Aj Bj = Aj Bj

Aj Bj = Aj Bj

Given a pair of subsets (Aj, Bj) of Yj the componentwise

dissimilarity(Aj,Bj) is:

(Aj, Bj) =Aj Bj Aj Bj+ (2Aj BjAj Bj)

where 0 0.5 and Ajis defined depending on variable types.

Aj Bj

Aj Bj

Aj Bj

Page 17: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

17

ICHINO & YAGUCHI’S DISSIMILARITY MEASURES

(Aj,Bj) are aggregated by an aggregation function such as the

generalised Minkowski’s distance of order q:U_2

Drawback: dependence on the chosen units of measurements.Solution: normalization of the componentwise dissimilarity:U_3

The weighted formulation guarantees that dq(a,b)[0,1].

U_4

qp

j

qjjq BAbad

1),(),(

j

jjjjq

p

j

qjjq

Y

BABABAbad

),(),( ,),(),(

1

qp

j

qjjjq BAcbad

1),(),(

The above measures are The above measures are metricsmetrics

Page 18: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

18

DE CARVALHO’S DISSIMILARITY MEASURES

A straightforward extension of similarity measures for classical data matrices with nominal variables.

where (Vj) is either the cardinality of the set Vj (if Yj is a nominal variable) or the length of the interval Vj (if Yj is a continuous variable).

Agreement Disagreement TotalAgreement =(Aj

Bj) =(Ajc(Bj)) (Aj)

Disagreement =(c(Aj)Bj) =(c(Aj)c(Bj)) (c(Aj))Total (Bj) (c(Bj)) (Yi)

Page 19: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

19

DE CARVALHO’S DISSIMILARITY MEASURES

Five different similarity measures si, i = 1, ..., 5, are defined:

The corresponding dissimilarities are di = 1 si. The di are aggregated by an aggregation function AF such as the generalised Minkowski metric, thus obtaining:SO_1 5i1 ),(),(

1

qp

j

qjjij

ia BAdwbad

si Comparison Function Range Property s1 / (++) [0,1] metric s2 2/ (2++) [0,1] semi metric s3 / (+2+2) [0,1] metric s4 ½[/ (+)+/ (+)] [0,1] semi metric S5 / [(+)(+)]½ [0,1] semi metric

Page 20: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

20

DE CARVALHO’S EXTENSION OF ICHINO & YAGUCHI’S DISSIMILARITY MEASURE

A different componentwise dissimilarity measure:

where is defined as in Ichino & Yaguchi’s dissimilarity measure. The aggregation function AF suggested by De Carvalho is:SO_2

jj

jjjj BA

BABA

,

,

qp

j

q

jjq BAp

bad

1

),(1

),(

This measure is a metric. This measure is a metric.

Page 21: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

21

THE DESCRIPTION-POTENTIAL APPROACH

All dissimilarity measures considered so far are defined by two functions: a comparison function (componentwise measure) and an aggregation function.A different approach is based on the concept of description potential (a) of a symbolic object a.

where (Vj) is either the cardinality of the set Vj (if Yj is a

nominal variable) or the length of the interval Vj (if Yj is a

continuous variable).

p

jjAa

1)()(

Page 22: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

22

THE DESCRIPTION-POTENTIAL APPROACH

SO_3

SO_4

SO_5

The triangular inequality does not hold for SO_3 and The triangular inequality does not hold for SO_3 and SO_4, which are equivalent. SO_5 is a metric. SO_4, which are equivalent. SO_5 is a metric.

)]()()(2[)()(),(1 bababababad

)(

)]()()(2[)()(),(2 Ea

bababababad

)()]()()(2[)()(

),(2 bababababa

bad

Page 23: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

23

DESCRIPTION POTENTIAL FOR CONSTRAINED BSO’S

Given a BSO a and a logical dependence expressed by the rule:

if [Yj = Sj] then [Yi = Si]

the incoherent restriction a’ of a is defined as:

a’= [Y1=A1] ... [Yj-1=Aj-1] [Yj=Aj Sj] ... [Yi-1=Ai-1] [Yi=Ai (Yi\Si)] ... [Yp=Ap]

Then the description potential of a is:

A similar extension exists for hierarchical dependencies.

p

jj aAa

1)()()(

Page 24: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

24

DISSIMILARITY MEASURES FOR CONSTRAINED BSO’S

•The extended definition of description potential can be applied to the computation of the distances SO_3, SO_4 and SO_5.•De Carvalho proposed an extension of ’, so that SO_2 can also be applied to constrained BSO.•He also proposed an extension of , , , and in order to take into account of constraints. Therefore, SO_1 can also be applied to constrained BSO.Finally, C_1 is defined as follows:where:

If all BSO’s are coherent, then the dissimilarity measures If all BSO’s are coherent, then the dissimilarity measures do not change.do not change.

q p

j

p

j

qjji

q

j

BAd

bad

1

1

)(

,

),(

otherwise1

if0)(

NAYj j

Page 25: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

31

MATCHING

• Matching is the process of comparing two or more structures to discover their similarities or differences.• Similarity judgements in the matching process are directional: They have a• referent, a, a prototype or the description of a class of objects• subject, b, a variant of the prototype or an instance of a class of objects. • Matching two structures is a common problem to many domains, like symbolic classification, pattern recognition, data mining and expert systems.

Page 26: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

32

MATCHING BSO’S

•Generally, a BSO represents a class description and plays the role of the referent in the matching process. a: [color = {black, white}] [height =[170, 200]]describes a set of individuals either black or white, whose height is in the interval [170,200]. Such a set of individuals is called extension of the BSO. The extension is a subset of the universe of individuals Given two BSO’s a and b, the matching operators define whether b is the description of an individual in the extension of a.• In the ASSO software two matching operators for BSO’s have been defined.

Page 27: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

33

CANONICAL MATCHING OPERATOR

• The result of the canonical matching operator is either 0 (false) or 1 (true). • If E denotes the space of BSO’s described by a set of p variables Yi taking values in the corresponding domains Yi, then the matching operator is a function:

Match: E × E {0, 1}such that for any two BSO’s a, b E:

a = [Y1=A1] [Y2=A2] ... [Yp=Ap]

b = [Y1=B1] [Y2=B2] ... [Yp=Bp]it happens that:

• Match(a,b) = 1 if BiAi for each i=1, 2, , p,• Match(a,b) = 0 otherwise.

Page 28: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

34

CANONICAL MATCHING OPERATOR

Examples:

District1 = [profession={farmer, driver}] [age=[24,34]]

Indiv1 = [profession=farmer] [age=28]

Indiv2 = [profession=salesman] [age=[27,28]]

Match(District1,Indiv1) = 1 Match(District1,Indiv2) = 0

Page 29: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

35

CANONICAL MATCHING OPERATOR

• The canonical matching function satisfies two out of three properties of a similarity measure:

a, b E: Match(a, b) 0 a, b E: Match(a, a) Match(a, b)

while it does not satisfy the commutativity or simmetry property:

a, b E: Match(a, b) = Match(b, a)because of the different role played by a and b.

Page 30: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

36

FLEXIBLE MATCHING OPERATOR

• The requirement BiAi for each i=1, 2, , p, might be too strict for real-world problems, because of the presence of noise in the description of the individuals of the universe. • Example:District1 = [profession={farmer, driver}] [age=[24,34]]

Indiv3 = [profession=farmer] [age=23]

Match(District1,Indiv3) = 0• It is necessary to rely on a flexible definition of matching operator, which returns a number in [0,1] corresponding to the degree of match between two BSO’s, that is

flexible-matching: E × E [0,1]

Page 31: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

37

FLEXIBLE MATCHING OPERATOR

For any two BSO’s a and b, i) flexible-matching(a,b)=1 if Match(a,b)=true, ii) flexible-matching(a,b)[0,1) otherwise.

The result of the flexible matching can be interpreted as the probability of a matching b provided that a change is made in b.

Let Ea = {b' E | Match(a,b')=1} and P(b | b') be the conditional probability of observing b given that the original observation was b'. Then

that is flexible-matching(a,b) equals the maximum conditional probability over the space of BSO’s canonically matched by a.

)'|('max = ),matching(-flexible bbP

aEbdefba

Page 32: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

38

FLEXIBLE MATCHING: AN APPLICATION

• Credit card applications (Quinlan)• Fifteen variables whose names and values

have been changed to meaningless symbols to protect the confidentiality of the data.

+• class variable: positive in case of approval

of credit facilities, negative otherwise.• Training set: 490 cases• 6 rules generated by Quinlan’s system C4.5

Page 33: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

39

FLEXIBLE MATCHING: AN APPLICATION

• Such rules can be easily represented by means of Boolean symbolic objects.

• Both matching operators can be considered in order to test the validity of the induced rules.

Rule Class Conditions41 - [Y3 > 1.54] [ Y9 = f ] [ Y4 {u, y}]

[Y6{c,d, cc, i, j, k, m, r, q, w, e, aa, ff}]

43 - [ Y4 {u, y}] [ Y8 <= 1.71 ] [ Y9 = f ]6 - [ Y3 <= 0.835] [ Y6 {c,d,i,k,m,q,w,e,aa }]

[ Y7 {v,bb}] [Y14 > 102] [Y15 <= 500]30 + [ Y9 = t ]34 + [Y3 <= 0.125 ] [Y14 > 221 ]46 + [Y4 {l} ]

Page 34: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

40

A new dissimilarity measure

Flexible matching is asymmetric. However it is possible to “symmetrize” it New dissimilarity measure SO_6

It is computed asd(a,b) =

= 1-(flexible_matching(a,b)

+flexible_matching(b,a))/2

Page 35: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Dissimilarity measure Parameters Constraints Default U_1 (Gowda & Diday) none U_2 (Ichino & Yaguchi) Gamma

Order of power [0 .. 0.5] 1 .. 10

0.5 2

U_3 (Normalized Ichino & Yaguchi)

Gamma Order of power

[0 .. 0.5] 1 .. 10

0.5 2

U_4 (Weighted Normalized Ichino & Yaguchi)

Gamma Order of power List of weights, one per var.

[0 .. 0.5] 1 .. 10 Sum(weights) = 1.0

0.5 2 Equal weights

C_1 (Normalized De Carvalho)

Comparison function Order of power

D1, D2, D3, D4, D5 1 .. 10

D1 2

SO_1 (De Carvalho) Comparison function Order of power List of weights, one per var.

D1, D2, D3, D4, D5 1 .. 10 Sum(weights) = 1.0

D1 2 Equal weights

SO_2 (De Carvalho) Gamma Order of power

[0 .. 0.5] 1 .. 10

0.5 2

SO_3 (De Carvalho) Gamma Order of power

[0 .. 0.5] 1 .. 10

0.5 2

SO_4 (Normalized De Carvalho)

Gamma Order of power

[0 .. 0.5] 1 .. 10

0.5 2

SO_5 (Normalized De Carvalho)

Gamma Order of power

[0 .. 0.5] 1 .. 10

0.5 2

SO_6 (Symmetrized Flexible Matching)

none

Page 36: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

42

Probabilistic symbolic objects (PSO’s) involve modal (probabilistic) variables.

Each cell represents the set of weighted values that the variable can take for a symbolic object, where a probabilistic weighting system is adopted.

In case of PSO, it isn’t possible to use dissimilarity measures for BSO because they don’t take the probabilities into consideration and so this determines a notable information loss.Therefore, new dissimilarity measures for PSO are needed.

PROBABILISTIC SYMBOLIC OBJECT (PSO’S)

Page 37: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

43

Defining dissimilarity measures for probabilistic symbolic objects

Steps:1. Define coefficients measuring the divergence

between two probability distributions

Kullback-Leibler divergence Chi-square divergence Hellinger K-divergence Variation distance

(*) from them two dissimilarity measures, namely the Renyi’s and Chernoff’s coefficients, are obtained

non-symmetric coefficients

symmetric coefficient

similarity coefficient (*)

Page 38: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

44

Defining dissimilarity measures for probabilistic symbolic objects

Steps:2. Symmetrize the non symmetric

coefficientsm(P,Q)= m(Q,P) + m(P,Q)

3. Aggregate the contribution of all variables to compute the dissimilarity between two symbolic objects

PSO Dissimilarity measures

Page 39: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

45

Mixture SO

Some SO’s can be described by both non-modal and modal variables

They are neither BSO’s nor PSO’sWhat dissimilarity measure, then? In ASSO it has been proposed to combine

the result of two dissimilarity measure, one for modal and the other for non-modal.

Combination can be either additive or multiplicative.

This possibility should be taken with great care!!!

Page 40: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

46

REFERENCES• Esposito F., Malerba D., V. Tamma, H.-H. Bock. Classical

resemblance measures. Chapter 8.1• Esposito F., Malerba D., V. Tamma. Dissimilarity measures

for symbolic objects. Chapter 8.3• Esposito F., Malerba D., F.A. Lisi. Matching symbolic

objects. Chapter 8.4in H.-H. Bock, E. Diday (eds.): Analysis of Symbolic Data.

Exploratory methods for extracting statistical information from complex data. Springer Verlag, Heidelberg, 2000.

• D. Malerba, L. Sanarico, & V. Tamma (2000). A comparison of dissimilarity measures for Boolean symbolic data. In P. Brito, J. Costa, & D. Malerba (Eds.), Proc. of the ECML 2000 Workshop on “Dealing with Structured Data in Machine Learning and Statistics”, Barcelona.

• D. Malerba, F. Esposito, V. Gioviale, & V. Tamma. Comparing Dissimilarity Measures in Symbolic Data Analysis. Pre-Proceedings of EKT-NTTS, vol. 1, pp. 473-481.

Page 41: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

47

REFERENCES• D. Malerba, F. Esposito, M. Monopoli (2002). Estrazione e

matching di oggetti simbolici da database relazionali. Atti del Decimo Convegno Nazionale su Sistemi Evoluti per Basi di Dati SEBD’2002, 265-272.

• D. Malerba, F. Esposito, & M. Monopoli (2002). Comparing dissimilarity measures for probabilistic symbolic objects. In A. Zanasi, C. A. Brebbia, N.F.F. Ebecken, P. Melli (Eds.) Data Mining III, Series Management Information Systems, Vol 6, 31-40, WIT Press, Southampton, UK.

• E. Diday, F. Esposito (2003). An Introduction to Symbolic Data Analysis and the Sodas Software, Intelligent Data Analysis, 7, 6, (in press).

• Other project reports

Page 42: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

48

METHOD DISS

• Dissimilarity measures between both BSO’s and PSO’s.

• Input: Asso file of SO’s• Output for

dissimilarities: Report + Asso file with dissimilarity matrix

• Developer: Dipartimento di Informatica, University of Bari, Italy.

DI method

Report file

Page 43: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

49

TWO USE CASE DIAGRAMS

Run the DISS method and generate a new ASSO file with a dissimilarity matrixUser

Create a new chaining with the new ASSO file

Create an ASSO chaining with the DISS method

Set up parameters of the DISS method

Run the DISS method and generate a report file

User

View report file

Create an ASSO chaining with the DISS method

Set up parameters of the DISS method

Run VDISS and visualize the dissimilarity measure, the bi-dimensional mapping & the graphical representation

Page 44: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

50

PARAMETER SETUP

• The user can select a subset of variables Yi on which the dissimilarity measure or the matching operator has to computed .

Page 45: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

51

PARAMETER SETUP• The user can select a number of parameters.

Dissimilarity measure

Name of the new ASSO file ?

combine

?

Page 46: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

52

OUTPUT SODAS FILE

• The output ASSO file contains both the same input data and an additional dissimilarity matrix. The dissimilarity between the i-th and the j-th BSO is written in the cell (entry) (i, j) of the matrix.

• Only the lower part of the dissimilarity matrix is reported in the file, since dissimilarities are symmetric.

abalone output file

Page 47: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

OUTPUT REPORT FILE

The report file is organized as follows:

Output report file

Page 48: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

54

Output

Visualization of the dissimilarity table

Page 49: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

55

OutputVisualization of a line graph of dissimilarities

Each line represents the dissimilarity between a given SO and the subsequent SOs in the file

The number of lines in each graph is equal to the number of SOs minus one

Page 50: DISSIMILARITIES AND MATCHING BETWEEN SYMBOLIC OBJECTS Prof. Donato Malerba Department of Informatics, University of Bari, Italy malerba@di.uniba.it ASSO

Fare clic per modificare lo stile del titolo dello schema

Fare clic per modificare gli stili del testo dello schema Secondo livello

Terzo livello• Quarto livello

– Quinto livello

56

Output

Visualization of a scatterplot of Sammon’s nonlinear mapping into a bidimensional space