september 5-7, trento deriving “sub-source” similarities from heterogeneous, semi-stuctured...

28
September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento di Informatica, Matematica, Elettronica e Trasporti Università “Mediterranea” di Reggio Calabria International Conference on Cooperative Information Systems (CoopIS 2001)

Upload: clemence-price

Post on 02-Jan-2016

222 views

Category:

Documents


1 download

TRANSCRIPT

Page 1: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

September 5-7, Trento

Deriving “sub-source” similarities from heterogeneous, semi-stuctured

information sources

D. Rosaci, G. Terracina, D. Ursino

Dipartimento di Informatica, Matematica, Elettronica e Trasporti

Università “Mediterranea” di Reggio Calabria

International Conference on Cooperative Information Systems (CoopIS 2001)

Page 2: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Scheme Match: Finding a mapping between those elements of two schemes that semantically correspond to each other

Applications: information source integration, e-commerce, scheme evaluation and migration, data and web warehousing, information source design and so on

The need of semi-automatic techniques for carrying out this task is nowadays recognized

Most of the techniques for Scheme Match proposed in the literature have been designed only for databases

Motivations

They aimed at deriving terminological and structural relationships between single concepts

Page 3: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

New approaches to Scheme Match, handling semi-structured information sources, appear to be compulsory

Such approaches must be somehow different from the traditional ones since:

• in semi-structured information sources significant pieces of information are expressed in the form of groups of concepts rather than single ones

• different instances of the same concept could have different structures

The emphasis shifts away from the extraction of semantic correspondencies between concepts to the derivation of semantic correspondencies between groups of concepts

Motivations

Page 4: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

We propose a semi-automatic technique for extracting similarities between sub-sources belonging to different, heterogeneous and semi-structured information sources

The adoption of a conceptual model, capable to uniformly handle information sources of different formats, appears to be extremely useful

Translation rules should be defined from classical information source formats to the adopted conceptual model

Our approach exploits the SDR-Network conceptual model which meets the requirements described above

General characteristics of the approach

Page 5: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Given an information source IS, the number of possible sub-sources that can be derived from it is extremely high

In order to avoid handling huge numbers of sub-source pairs, we propose an heuristic technique for singling out only the most promising ones

After that the most promising pairs of sub-sources have been selected, their similarity degree must be computed

The similarity degree associated to each pair of sub-sources is determined by computing the objective function associated to a maximum weight matching

General characteristics of the approach

SSi can be detected to be similar to SSj only if it possible to single out concepts of SSi and SSj that are pairwise similar in their turn

Page 6: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

The SDR-Network and its metrics have been already exploited for defining a technique for deriving synonymies and homonymies

In the whole, we propose a unified, semi-automatic approach for deriving concept synonymies and homonymies, as well as sub-source similarities

This is particularly interesting since:

• We are proposing the derivation of a property which, generally, is not handled by most of the approaches for Scheme Match proposed in the literature

• The technique proposed here is part of a more general framework for deriving various kinds of terminological and structural properties

General characteristics of the approach

Page 7: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

The SDR-Network conceptual model

Given an information source IS, the associated SDR-Network Net(IS) is

Net(IS) = < NS(IS), AS(IS) >

NS(IS) represents the set of nodes; each node is characterized by a name

AS(D) denotes a set of arcs; each arc can be represented by a triplet < S, T, LST >

S is the source node

T is the target node

LST = [dST, rST] is a label associated with the arc

Page 8: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

The SDR-Network conceptual model

• dST is the semantic distance coefficient:

– it indicates how much the concept expressed by T is semantically close to the concept expressed by S

– this depends from the capability of the concept associated with T to characterize the concept associated with S

• rST is the semantic relevance coefficient: it indicates the fraction of instances of the concept denoted by S whose complete definition requires at least one instance of the concept represented by T

Page 9: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

The SDR-Network conceptual model

• The Path Semantic Distance PSDP of a path P in Net(IS) is the sum of

the semantic distance coefficients associated with the arcs included in

the path

• The Path Semantic Relevance PSRP of a path P in Net(IS) is the product

of the semantic relevance coefficients associated with the arcs

included in the path

• The CD-Shortest-Path (Conditional D-Shortest-Path) between two

nodes N and N’ in Net(IS) and including an arc A (denoted by N, N’ A)

is the path having the minimum Path Semantic Distance among those

connecting N and N’ and including A

• A D-Pathn is a path P in Net(IS) such that n PSDP < n+1

• The i-th neighborhood of an SDR-Network node x is:

nbh(x,i) = {A|AAS(IS), A=<z,y,lzy>, x,yA is a D_Pathi, xy} i0

Page 10: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

The number of possible sub-sources that can be identified in IS is exponential in the number of nodes of Net(IS)

We have defined a technique for singling out the most promising pairs of sub-sources

The proposed technique receives two information sources IS1 and IS2 and a Dictionary SD of Synonymies between nodes of Net(IS1) and Net(IS2)

Synonymies are represented in SD by tuples of the form <Ni, Nj, fij>, where Ni and Nj are the synonym nodes and fij is a coefficient in the real interval [0,1], indicating the similarity degree of Ni and Nj

Selection of promising pairs of sub-sources

Page 11: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Selection of promising pairs of sub-sources

• The technique works according to the following rules:

— It considers those pairs of sub-sources [SSi, SSj] such that

SSiNet(IS1) is a rooted sub-net having a node Ni as root,

SSj Net(IS2) is a rooted sub-net having a node Nj as root, Ni and

Nj are interesting synonyms i.e., the synonym coefficient

associated with them is greater than a certain threshold

— It computes the maximum weight matching on some suitable

bipartite graphs obtained from the target nodes of the arcs

included in the neighborhoods of Ni and Nj

— Given a pair of synonym nodes Ni and Nj, it derives a promising

pair of sub-sources [SSik,SSjk], for each k such that both nbh(Ni,k)

and nbh(Nj,k) are not empty

— SSik and SSjk are constructed by determining the promising pairs

of arcs [Aik,Ajk] such that Aik nbh(Ni,l), Ajk nbh(Nj,l), for each l

belonging to the integer interval [0,k]

Page 12: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Selection of promising pairs of sub-sources

— A pair of arcs [Aik,Ajk] is considered promising if

An edge between the target nodes Tik of Aik and Tjk of Ajk is present in

the maximum weight matching computed on a suitable bipartite graph

constructed from the target nodes of the arcs of nbh(Ni,l) and nbh(Nj,l)

for some l belonging to the integer interval [0,k]

The similarity degree of Tik and Tjk is greater than a certain given

threshold

• The rationale underlying this approach is that of constructing

promising pairs of sub-sources such that each pair consists in the

maximum possible number of pairs of concepts whose synonymy has

been already stated

Page 13: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Selection of promising pairs of sub-sources

• Theorem

• Let IS1 and IS2 be two information sources and let Net(IS1)

and Net(IS2) be the corresponding SDR-Networks. Let nc1

(resp., nc2) be the number of complex nodes of Net(IS1)

(resp., Net(IS2)). Let l be the maximum neighborhood index

associated with a node of Net(IS1) or Net(IS2). Then the

number of possible pairs of sub-sources is

min(nc1,nc2)x(l+1)

• Actually, in real applications, the number of promising

pairs of sub-sources relative to IS1 and IS2 is, generally, far

lesser than min(nc1,nc2)x(l+1)

Page 14: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Example

The SDR-Network of the European Social Funds (ESF) information source

Page 15: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Example

The SDR-Network of European Community Projects (ECP) information source

Page 16: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Example

The Synonymy Dictionary associated with ESF and ECP

Page 17: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Example

• The interesting pairs of synonym nodes are

{<Judicial Person[ESF], Partner[ECP], 0.59>, <Payment[ESF], Payment[ECP], 0.65>,

<Project[ESF], Project[ECP], 0.63>}

• As an example, consider the pair of synonym nodes Project[ESF] and

Project[ECP]

• Since the neighborhoods of Project[ESF] and Project[ECP] are both not

empty only for k=0, k=1 and k=2, our technique obtains three

promising pairs of sub-sources relative to Project[ESF] and Project[ECP]

• In order to provide an example of the behaviour of our technique, we

show the derivation of the promising pairs of sub-sources associated

with Project[ESF] and Project[ECP] for k=0

Page 18: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Example

• The bipartite graph and the associated maximum weight matching are

Page 19: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Example

• The technique selects only those arcs of nbh(Project[ESF],0) and

nbh(Project[ECP],0) which participate to the matching and have a

similarity coefficient greater than a certain given threshold

• The promising pair of sub-sources associated with nbh(Project[ESF],0)

and nbh(Project[ECP],0) is [SS1, SS2]:

• SS1 = { < Project[ESF], Country[ESF], [0,1]>, Project[ESF], Type[ESF], [0, 0.9]>,

<Project[ESF], ESF_Contribution[ESF], [0, 0.75]>,

<Project[ESF], Country_Share[ESF], [0,0.9]>}

• SS2 = { < Project[ECP], Country[ECP], [0,1]>, Project[ECP], Type[ECP], [0, 0.6]>,

<Project[ECP], ESF_Contribution[ECP], [0, 0.8]>,

<Project[ECP], Country_Share[ECP], [0,1]>}

• The technique works analogously for k=1 and k=2 as well as for the

other interesting synonym pairs

Page 20: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Derivation of sub-source similarities

• The technique for deriving sub-source similarities from a given pair of

information sources consists of two steps

The first one computes the similarity degree relative to each promising pair

of sub-sources derived previously

The second one constructs a Sub-source Similarity Dictionary SSD by

selecting only those pairs of sub-sources whose similarity degree is greater

than a certain, dinamically computed, threshold

• More formally, the technique can be encoded as follows:

SSD = ((SPS,SD))

• where:

• SPS is the set of promising pairs of sub-sources

• SD is the Synonymy Dictionary

Page 21: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Derivation of sub-source similarities

• For each promising pair of sub-sources SSi and SSj, the function calls

a function ’ which computes the corresponding similarity degree

SSS = (SPS, SD) = { < SSi, SSj, ’(SD, ’(SSi),

’(SSj))> | [SSi, SSj] SPS}

• The function ’ receives a rooted sub-net SS and returns the nodes of

SS

• The function ’ derives the similarity degree associated with SSi and

SSj by computing a suitable objective function associated with the

maximum weight matching on a bipartite graph, constructed from the

nodes of SSi and SSj

’ (T,P,Q) = (1 – ((|P|+|Q|-2|E’|)/(|P|+|Q|)) x

(’(E’)/|E’|)

Page 22: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Derivation of sub-source similarities

• The function is called for constructing the Sub-source Similarity

Dictionary SSD by taking those similarities of SSS having a coefficient

greater than a certain, dynamically computed, threshold

SSD = (SSS) = {<SSi, SSj, fij> | <SSi, SSj, fij>

SSS, fij>thSim}

• Here thSim is the dinamically computed threshold

thSim = min ((FMax+FMin)/2, thM)

• where

• FMax is the maximum coefficient associated with the similarities of

SSS

• FMin is the minimum coefficient associated with the similarities of

SSS

• thM is a limit threshold value

Page 23: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Example

• Consider the SDR-Networks ESF and ECP

SSD = ((SPS,SD))

• As for the pair of sub-sources [SS1, SS2] SPS derived previously,

calls ’(SD, ’(SS1), ’(SS2))

• The bipartite graph and the associated maximum weight matching

relative to ’(SS1) and ’(SS2) are

Page 24: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Example

• The objective function associated to the maximum weight matching is

(1 – ((5+5-2*5)/10))*(0.63+1+1+1+1)/5=0.93

• In the same way the similarity degrees associated with all the other

promising pairs of sub-sources are obtained

• Then SSS is provided in input to the function which constructs the

Sub-source Similarity Dictionary SSD

• SSD is determined by selecting those triplets of SSS whose similarity

coefficient is greater than thSim

• In this example all similarities of SSS are valid and SSD = SSS

Page 25: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Sub-source similarities can be exploited in several contexts

All applications of Scheme Match relative to synonymies between single concepts can be extended to similarities between sub-sources

In particular, sub-source similarities can be exploited for:

Applications

Information Source Integration

E-commerce

Semantic Query Processing

Data and Web Warehouse

Source clustering and cataloguing

Page 26: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

We have presented a semi-automatic technique for deriving similarities of sub-sources belonging to information sources having different formats

The technique is based on a conceptual model, called SDR-Network, which allows to uniformly represent information sources of different formats

It consists of two steps: the first one selects a set of promising pairs of sub-sources, whereas the second one computes a similarity degree to associate with each pair of the set

We have pointed out that the derivation of sub-source similarities is a special case of the more general problem of Scheme Match

Conclusions

Finally, we have illustrated a set of applications which could benefit of sub-source similarities

Page 27: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

Present and Future Work

• We have already designed an approach which exploits sub-source

similarities for carrying out information source integration

• In the future we plan to:

Develop techniques which exploit sub-source similarities in the

other possible application contexts we have previously mentioned

Define techniques for deriving other terminological and structural

properties in the context of semi-structured information sources

Page 28: September 5-7, Trento Deriving “sub-source” similarities from heterogeneous, semi-stuctured information sources D. Rosaci, G. Terracina, D. Ursino Dipartimento

For more information...

Domenico Ursino

Dipartimento di Informatica, Matematica, Elettronica, Trasporti

Università Mediterranea di Reggio Calabria

E-mail: [email protected]

Web: http://www.ing.unirc.it/didattica/inform00/gruppo