mapping data in peer-to-peer systems:semantics and algorithmic issues department of computer science...
TRANSCRIPT
MAPPING DATA IN PEER-TO-PEER SYSTEMS:SEMANTICS AND
ALGORITHMIC ISSUES
Department of Computer Science University of TorontoAnastasios Kementsietsidis & Marcelo Arenas & Renee J.Miller
presented by Ahmet OLGUN& Suzan BAYHAN
OUTLINE
1-ABSTRACT
2-INTRODUCTION
3-MOTIVATING EXAMPLE
4-MAPPING TABLES
5-MAPPING AS CONSTRAINTS
6-CONSISTENCY AND INTERFERENCE
7-THE ALGORITHM
8-EXPERIMENTAL RESULTS
9-CONCLUSIONS
ABSTRACTPROBLEM OF MAPPING DATA IN PEER-TO-PEER DATA SHARING SYSTEMS(PPDSS)
MAPPING TABLES LISTING CORRESPONDING VALUES IN A PPDSS
WHY TABLES ARE APPROPRIATE
A LANGUAGE TO SPECIFY MAPPING TABLES UNDER DIFFERENT SEMANTICS
COMPLEXITY OF THE PROBLEM
AN EFFICIENT ALGORITHM FOR ITS SOLUTION
IMPLEMENTATION WITH EXPERIMENTAL RESULTS
HYPERION PROJECT
INTRODUCTION Traditionally data integration and exchange bw
heterogeneous data sources is provided mainly through use of views i.e., queries
Sources share their schemas and cooperate BUT IN OUR WORK SUCH CLOSE
COOPERATION IS Not desirable (PRIVACY) Not feasible (maybe due to resource limitations)
SIMILARITY WITH FILE-SHARING SYSTEMS
TO FIND DATA WHEN THERE IS NO AGREEMENT ON THE LOGICAL DESIGN OF DATA,
FOCUS ON VALUES AND HOW THEY CORRESPOND
IN FILE SHARING SYSTEMS LIKE NAPSTER AND GNUTELLA ,QUERYING IS DONE ON SIMPLE VALUE SEARCH OF FILE NAMES
QUERIES ARE OF THE FORM:
“RETRIEVE ALL FILES NAMED X”
EASY BECAUSE THERE IS A CONSENSUS ON NAMES
WHAT IF NO ACCEPTED NAMING STANDARD???
Each peer has to develop its own naming standard Conforming external standards is time-consuming and
expensive
So to search data in such environments MAPPING TABLES that store correspondence between values.
At simplest, tables are binary tables corresponding identifiers from two different sources
Mapping Tables represent EXPERT KNOWLEDGE
MOTIVATING EXAMPLE
DOMAIN:BIOLOGICAL DATABASES
* GENE DATABASEGDB
* PROTEIN DATABASESwissProt
* GENETIC DISORDERS AND RELATED GENES DATABASEMIM
EXAMPLE (CONTD) Integration of these resources is extremely
desirable for scientists to have uniforn access BUT SEEMS UNATTAINABLE due to political,financial and technical reasons.
Among technical reasons , heterogeneity of sources like formatted files,spreadsheets,relational databases
MAIN CHARACTERISTICS AND USE OF MAPPING TABLES
Associations within and Across Domains Peer Autonomy Semantics Automated discovery of mappings
Association within and Across Domains
Mapping table is not necessarily a function By mapping tables we associate seemingly
unconnect databases Disjoint worlds can be associated since the
corresponding worlds are semantically close to each other
Peer Autonomy
Autonomy has high importance in peer-to-peer systems.
Mapping tables do not restrict the operation of peers in any way beyond the agreement on values expressed in the tables.
Mapping Table 1
GDB_id SwissProt_id
GDB:120231 P21359
GDB:120231 Q00662
GDB:120231 Q9UMK3
GDB:120232 P35240
GDB:120233 P01138
Figure 1
Semantics
Experts have varying degree of expertise,so we should better show the confidence level of mapping tables
A tuple :(X,Y) If X value appearing in a mapping table
follows the open-world semantics then it can be associated with any Y value-Partial Information about X
Closed World If X follows Closed-World semantics, then
values in the table can only be associated with the specified Y values.
4 alternatives
1-OO (No specific information,no practical interest)
2-OC (Partial knowledge)
3-CO(Partial knowledge)
4-CC(complete knowledge)
Open/Closed World
Open-world Closed-world
Present X value
Any Y-value indicated Y values
Missing X value
Any Y-value No Y value
Table 1:Alternative open/closed world semantics
Automated Discovery
Given a semantics for mapping tables, to reason about them,treat mapping tables as constraints on the exchange of information.
Simplest way to combine tables CONJUNCTION
Example Mapping Tables
GDB_id SwissProt_id MIM_idGDB:120231 P21359 162200
GDB:120231 Q00662 193520
GDB:120232 P35240 101000
GDB_id SwissProt_id
GDB:120231 Q00662
GDB_id MIM_id
GDB:120233 162030
MAPPING TABLESA,B,C,D individual attributes
dom(A) domain of A like integers,characters
U,X,Y set of attributes
R a relational schema
R[U] attributes of a schema
r relation instance
t tuples
MAPPING TABLES(contd)t[X]values of tuple t in attributes of X
X={A1,A2.... Ak}
dom(X)=dom(A1)Xdom(A2)X...Xdom(Ak)
To represent different semantics of mapping tables,it is necessary to introduce variables
V a set of variables where V∩ dom(A)=Φ for each attribute of A
DEFINITION 1
Given a set of attributes U,t is a mapping over U if for each AєU ,t[A] is either a constant in dom(A),a variable in V or an expression of the form v-S,where vєV and S is a finite subset of dom(A)
DEFINITION 2
Let X and Y be nonempty disjoint set of attributes. A mapping table m from X to Y is a finite set of mappings over X U Y such that each variable appears in at most one mapping
DEFINITION 2 Set of mappings”mapping table” Tablerelations containing variables RESTRICT:Each variable appears in at
most one mapping TWO DIFFERENT MAPPINGS ARE
COMPLETELY INDEPENDENT
DEFINITION 3 A valuation ρ over a mapping table m is a
function that maps each constant value in m to itself and each variable v of m to a value in the intersection of the domains of the attributes where v appears.Furthermore,if v appears in an expression of the form v-S,then ρ (v) is not an element of S.
MAPPING AS CONSTRAINTS
View mapping tables as constraints on the exchange of information between sources
Given a set of mapping constraints,we are able to infer new mapping constraints and check the consistency of the constraints
GDB_id Gene_Nam
G231
G232
G233
NF1
NF2
NGFB
Swiss_id Protn_name
P359
P240
NF1
MERL
GDB_id Swiss_id
G232
v-{G232}
P240
v’-{P240}
GDB_id Gene_Nam Swiss_id Protn_name
G231
G232
G233
NF1
NF2
NGFB
P359
P240
P359
NF1
MERL
NF1
CONSISTENCY& INFERENCE Infer new mapping tables:
Combine the knowledge from mapping tables available in a network of peers
Determine consistency of mapping tables:Automated inference and consistency checks will help a curator to see whether semantics are valid
Problem Definition Given a mapping constraint formula
(MCF) Φ over a set of attributes U, Φ is consistent if there exists a nonempty relation r of U satisfying Φ.
Inference problem is the problem of verifying whether a set of MCFs implies another MCF
Theorems Theorem: The consistency problem for
conjunctions of mapping constraints is NP-complete.
Theorem: If the length of the paths or number of mapping constraints is fixed then the consistency problem for the conjunctions of mapping constraints is NP-complete.
Assumptions Assumptions to solve the consistency
problem: Number of mapping constraints per peer is
small The length of paths is small
For example in Gnutella paths have maximum size of 7
THE ALGORITHMθ =P1,P2,..,Pn a path of peers
Ui set of attributes at each peer
Σ set of constraints over path θ
μ :X Y a mapping constraint
ext(μ )={ρ (t) | t є m and ρ is a valuation over m}
THE ALGORITHM
1- Σ is consistent iff there exists t є ext(μ)
2- μ’:XY, Σ μ’ iff ext(μ) ext(μ’)
For inferenceFor inference: check 2 if Σ μ’
For consistencyFor consistency:check 1.
Design Decisions:P1,P2,P3,P4 path
Peer P1
μ1:A1B1
μ2:A1,A2B1,B2
μ3:A3B2, B3
μ4:A4B4
μ5:A5B5
μ6:A6B6
Peer P2
μ7 : B1,B4C1
μ8 :B3 C2
μ9 :B5C3
Peer P3
μ10 : C3D3
μ11 : C4 D4
Algorithm for computing the cover
P1 sends all mapping constraints to P2
P2 uses those constraints with his own to create a cover between P1 and P3
P2 forwards cover to P3
P3 does the same thing to create a cover bw P1 and P4
P3 sends the computed cover back to P1
Problems
Unnecessary computation
Cover involving A6 can be done locally
Does not work in streaming fashion
P1 has to wait for the whole computation to finish to get the cover between itself and P4
So ?...
Partitions
A1B1
A1,A2B1,B2
A3B2, B3
A4B4
A5B5
A6B6
π 1
π 2
π 3
π 4
Peer P1
B1,B4C1
B3C2
B5C3
π 5
π 6
π 7
π 8
π 9
Peer P2
Peer P3
C3D3
C4D4
Information Gathering P1 sends to P2 the set of attributes at each
partition BUT NO MAPPINGS P2 computes inferred partitions
Inferred partitions to discover interdependencies or lack thereof bw partitions
Then computation phase
Computation Phase The computation starts at penultimate peer Cover between P3 and P4 computed and
sent to P2
Cover between P2 and P4 computed and streamed to P1
Cover between P1 and P4 computed
EXPERIMENTAL RESULTS
Do our solutions provide added value for communities that already use mapping tables extenxively?
Are characteristics of our algorithm appropriate and effective in a peer-to-peer environment?
Implementation
Geographically distributed machines with one peer per machine
Each peer has 2 modules: First module interacts with the storage
manager to retrieve mappings and perform cover
Second is peer-to-peer networking protocol
Implementation
Each peer decides how much cache to use Biology Domain:6 Biological DB used GDB MIM SwissProt Hugo Locus Unigene Tabe sizes range from 7000 to 28000
mappings with an average of 13000. B2B Domain:business-to-business setting
Results Cache sizes from 64 to 128 mappings result
the best running times for those data character B2B
Complex semantics for tables,but still efficient new mappings
Total execution time scales linearly with the number of computed mappings
CONCLUSION Problem of managing collections of
mapping tables Alternative semantics for tables A language that allows specification of
mapping tables under different semantics Complexity of Inference and consistency An algorithm to solve the problem