mapping data in peer-to-peer systems:semantics and algorithmic issues department of computer science...

MAPPING DATA IN PEER-TO-PEER SYSTEMS:SEMANTICS AND

ALGORITHMIC ISSUES

Department of Computer Science University of TorontoAnastasios Kementsietsidis & Marcelo Arenas & Renee J.Miller

presented by Ahmet OLGUN& Suzan BAYHAN

OUTLINE

1-ABSTRACT

2-INTRODUCTION

3-MOTIVATING EXAMPLE

4-MAPPING TABLES

5-MAPPING AS CONSTRAINTS

6-CONSISTENCY AND INTERFERENCE

7-THE ALGORITHM

8-EXPERIMENTAL RESULTS

9-CONCLUSIONS

ABSTRACTPROBLEM OF MAPPING DATA IN PEER-TO-PEER DATA SHARING SYSTEMS(PPDSS)

MAPPING TABLES LISTING CORRESPONDING VALUES IN A PPDSS

WHY TABLES ARE APPROPRIATE

A LANGUAGE TO SPECIFY MAPPING TABLES UNDER DIFFERENT SEMANTICS

COMPLEXITY OF THE PROBLEM

AN EFFICIENT ALGORITHM FOR ITS SOLUTION

IMPLEMENTATION WITH EXPERIMENTAL RESULTS

HYPERION PROJECT

INTRODUCTION Traditionally data integration and exchange bw

heterogeneous data sources is provided mainly through use of views i.e., queries

Sources share their schemas and cooperate BUT IN OUR WORK SUCH CLOSE

COOPERATION IS Not desirable (PRIVACY) Not feasible (maybe due to resource limitations)

SIMILARITY WITH FILE-SHARING SYSTEMS

TO FIND DATA WHEN THERE IS NO AGREEMENT ON THE LOGICAL DESIGN OF DATA,

FOCUS ON VALUES AND HOW THEY CORRESPOND

IN FILE SHARING SYSTEMS LIKE NAPSTER AND GNUTELLA ,QUERYING IS DONE ON SIMPLE VALUE SEARCH OF FILE NAMES

QUERIES ARE OF THE FORM:

“RETRIEVE ALL FILES NAMED X”

EASY BECAUSE THERE IS A CONSENSUS ON NAMES

WHAT IF NO ACCEPTED NAMING STANDARD???

Each peer has to develop its own naming standard Conforming external standards is time-consuming and

expensive

So to search data in such environments MAPPING TABLES that store correspondence between values.

At simplest, tables are binary tables corresponding identifiers from two different sources

Mapping Tables represent EXPERT KNOWLEDGE

MOTIVATING EXAMPLE

DOMAIN:BIOLOGICAL DATABASES

* GENE DATABASEGDB

* PROTEIN DATABASESwissProt

* GENETIC DISORDERS AND RELATED GENES DATABASEMIM

EXAMPLE (CONTD) Integration of these resources is extremely

desirable for scientists to have uniforn access BUT SEEMS UNATTAINABLE due to political,financial and technical reasons.

Among technical reasons , heterogeneity of sources like formatted files,spreadsheets,relational databases

MAIN CHARACTERISTICS AND USE OF MAPPING TABLES

Associations within and Across Domains Peer Autonomy Semantics Automated discovery of mappings

Association within and Across Domains

Mapping table is not necessarily a function By mapping tables we associate seemingly

unconnect databases Disjoint worlds can be associated since the

corresponding worlds are semantically close to each other

Peer Autonomy

Autonomy has high importance in peer-to-peer systems.

Mapping tables do not restrict the operation of peers in any way beyond the agreement on values expressed in the tables.

Mapping Table 1

GDB_id SwissProt_id

GDB:120231 P21359

GDB:120231 Q00662

GDB:120231 Q9UMK3

GDB:120232 P35240

GDB:120233 P01138

Figure 1

Semantics

Experts have varying degree of expertise,so we should better show the confidence level of mapping tables

A tuple :(X,Y) If X value appearing in a mapping table

follows the open-world semantics then it can be associated with any Y value-Partial Information about X

Closed World If X follows Closed-World semantics, then

values in the table can only be associated with the specified Y values.

4 alternatives

1-OO (No specific information,no practical interest)

2-OC (Partial knowledge)

3-CO(Partial knowledge)

4-CC(complete knowledge)

Open/Closed World

Open-world Closed-world

Present X value

Any Y-value indicated Y values

Missing X value

Any Y-value No Y value

Table 1:Alternative open/closed world semantics

Automated Discovery

Given a semantics for mapping tables, to reason about them,treat mapping tables as constraints on the exchange of information.

Simplest way to combine tables CONJUNCTION

Example Mapping Tables

GDB_id SwissProt_id MIM_idGDB:120231 P21359 162200

GDB:120231 Q00662 193520

GDB:120232 P35240 101000

GDB_id SwissProt_id

GDB:120231 Q00662

GDB_id MIM_id

GDB:120233 162030

MAPPING TABLESA,B,C,D individual attributes

dom(A) domain of A like integers,characters

U,X,Y set of attributes

R a relational schema

R[U] attributes of a schema

r relation instance

t tuples

MAPPING TABLES(contd)t[X]values of tuple t in attributes of X

X={A1,A2.... Ak}

dom(X)=dom(A1)Xdom(A2)X...Xdom(Ak)

To represent different semantics of mapping tables,it is necessary to introduce variables

V a set of variables where V∩ dom(A)=Φ for each attribute of A

DEFINITION 1

Given a set of attributes U,t is a mapping over U if for each AєU ,t[A] is either a constant in dom(A),a variable in V or an expression of the form v-S,where vєV and S is a finite subset of dom(A)

DEFINITION 2

Let X and Y be nonempty disjoint set of attributes. A mapping table m from X to Y is a finite set of mappings over X U Y such that each variable appears in at most one mapping

DEFINITION 2 Set of mappings”mapping table” Tablerelations containing variables RESTRICT:Each variable appears in at

most one mapping TWO DIFFERENT MAPPINGS ARE

COMPLETELY INDEPENDENT

DEFINITION 3 A valuation ρ over a mapping table m is a

function that maps each constant value in m to itself and each variable v of m to a value in the intersection of the domains of the attributes where v appears.Furthermore,if v appears in an expression of the form v-S,then ρ (v) is not an element of S.

MAPPING AS CONSTRAINTS

View mapping tables as constraints on the exchange of information between sources

Given a set of mapping constraints,we are able to infer new mapping constraints and check the consistency of the constraints

GDB_id Gene_Nam

G231

G232

G233

NF1

NF2

NGFB

Swiss_id Protn_name

P359

P240

NF1

MERL

GDB_id Swiss_id

G232

v-{G232}

P240

v’-{P240}

GDB_id Gene_Nam Swiss_id Protn_name

G231

G232

G233

NF1

NF2

NGFB

P359

P240

P359

NF1

MERL

NF1

CONSISTENCY& INFERENCE Infer new mapping tables:

Combine the knowledge from mapping tables available in a network of peers

Determine consistency of mapping tables:Automated inference and consistency checks will help a curator to see whether semantics are valid

Problem Definition Given a mapping constraint formula

(MCF) Φ over a set of attributes U, Φ is consistent if there exists a nonempty relation r of U satisfying Φ.

Inference problem is the problem of verifying whether a set of MCFs implies another MCF

Theorems Theorem: The consistency problem for

conjunctions of mapping constraints is NP-complete.

Theorem: If the length of the paths or number of mapping constraints is fixed then the consistency problem for the conjunctions of mapping constraints is NP-complete.

Assumptions Assumptions to solve the consistency

problem: Number of mapping constraints per peer is

small The length of paths is small

For example in Gnutella paths have maximum size of 7

THE ALGORITHMθ =P1,P2,..,Pn a path of peers

Ui set of attributes at each peer

Σ set of constraints over path θ

μ :X Y a mapping constraint

ext(μ )={ρ (t) | t є m and ρ is a valuation over m}

THE ALGORITHM

1- Σ is consistent iff there exists t є ext(μ)

2- μ’:XY, Σ μ’ iff ext(μ) ext(μ’)

For inferenceFor inference: check 2 if Σ μ’

For consistencyFor consistency:check 1.

Design Decisions:P1,P2,P3,P4 path

Peer P1

μ1:A1B1

μ2:A1,A2B1,B2

μ3:A3B2, B3

μ4:A4B4

μ5:A5B5

μ6:A6B6

Peer P2

μ7 : B1,B4C1

μ8 :B3 C2

μ9 :B5C3

Peer P3

μ10 : C3D3

μ11 : C4 D4

Algorithm for computing the cover

P1 sends all mapping constraints to P2

P2 uses those constraints with his own to create a cover between P1 and P3

P2 forwards cover to P3

P3 does the same thing to create a cover bw P1 and P4

P3 sends the computed cover back to P1

Problems

Unnecessary computation

Cover involving A6 can be done locally

Does not work in streaming fashion

P1 has to wait for the whole computation to finish to get the cover between itself and P4

So ?...

Partitions

A1B1

A1,A2B1,B2

A3B2, B3

A4B4

A5B5

A6B6

π 1

π 2

π 3

π 4

Peer P1

B1,B4C1

B3C2

B5C3

π 5

π 6

π 7

π 8

π 9

Peer P2

Peer P3

C3D3

C4D4

Description of the Algorithm

Two phases: Information gathering Computation

Information Gathering P1 sends to P2 the set of attributes at each

partition BUT NO MAPPINGS P2 computes inferred partitions

Inferred partitions to discover interdependencies or lack thereof bw partitions

Then computation phase

Inferred Partitions

A1B1 B1,B4C1

A1,A2B1,B2 B3C2

A3B2, B3

A4B4

A5B5 B5C3

A6B6

Peer P1 Peer P2

Computation Phase The computation starts at penultimate peer Cover between P3 and P4 computed and

sent to P2

Cover between P2 and P4 computed and streamed to P1

Cover between P1 and P4 computed

EXPERIMENTAL RESULTS

Do our solutions provide added value for communities that already use mapping tables extenxively?

Are characteristics of our algorithm appropriate and effective in a peer-to-peer environment?

Implementation

Geographically distributed machines with one peer per machine

Each peer has 2 modules: First module interacts with the storage

manager to retrieve mappings and perform cover

Second is peer-to-peer networking protocol

Implementation

Each peer decides how much cache to use Biology Domain:6 Biological DB used GDB MIM SwissProt Hugo Locus Unigene Tabe sizes range from 7000 to 28000

mappings with an average of 13000. B2B Domain:business-to-business setting

Results Cache sizes from 64 to 128 mappings result

the best running times for those data character B2B

Complex semantics for tables,but still efficient new mappings

Total execution time scales linearly with the number of computed mappings

CONCLUSION Problem of managing collections of

mapping tables Alternative semantics for tables A language that allows specification of

mapping tables under different semantics Complexity of Inference and consistency An algorithm to solve the problem

ANY QUESTIONS?

THANK YOU...

mapping data in peer-to-peer systems:semantics and algorithmic issues department of computer science...

Documents

mapping data

functionby mapping tables

environments mapping

ppdsswhy tables

peer systems

corresponding values

logical design of data

yif x value