database-editing metrics for pattern matchinglaw/triples3.pdf · repositories and of patterns in...

Database-editing Metrics for Pattern Matching

Enrique H. RuspiniJerome Thomere

Michael Wolverton

Artificial Intelligence CenterSRI International

Menlo Park, California 94025

E-mail: [email protected]‖ [email protected]‖ [email protected]

WORKING PAPER – VERSION3March 25, 2003

Contents

1 Introduction 3

2 Abstract data representations 4

2.1 Imprecision and Uncertainty in Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2 Data modeling Approaches . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Relational models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.2 Logic-based representations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.2.3 Frames, logical records, and similar structures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.2.4 Entity-Relation Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3 Entity, Relations, Values . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

3 Similarity 10

3.1 Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.2 Similarity and Utility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3.3 Numerical measures of admissibility . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.1 Semantic Distance, Similarity, and Ontologies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11

3.3.2 Similarity between leaf nodes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

3.3.3 Complex Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

3.3.4 Generalized Similarity Measures . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 13

4 Data Modeling 14

4.1 Abstract Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

4.2 Similarities between triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

5 Database editing 18

5.1 Degree of matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.2 Admissibility measures for basic edit operations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18

5.2.1 Addition of triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2.2 Deletion of triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

5.2.3 Replacement of triples . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19

6 Graphs and mappings 19

6.1 Admissibility of mappings and transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21

6.2 Degree of matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

1

Database-editing Metrics for Pattern Matching

Enrique H. RuspiniJerome Thomere

Michael Wolverton

Artificial Intelligence CenterSRI International

Abstract

We address the problem of determining the degree of matching between a database and a pattern that specifies a set ofconstraints on the nature of database objects and their relations. We assume that the databases under consideration areimprecise in the sense that the values of properties of objects and relations or the existence of links between the former maynot be precisely known.

We discuss the transformation of databases into sets of triples (or labeled tuples) that define, albeit imprecisely, thevalues of properties of objects, or the potential existence of links between them. On the basis of the representation of datarepositories and of patterns in this canonical form, we formulate the pattern-matching problem as the equivalent logicalquestion of determining whether or not the database implies the logical expression defining the pattern.

Our formulation regards both databases and patterns as elastic constraints on data values that may be met to variousdegrees. The underlying formalism supporting this notion of imperfect matching is provided by the concept of similarity, thatis, a numerical measure of resemblance between objects. We propose a method to derive semantic-based similarity measuresby application of knowledge structures, such as ontologies, that characterize common properties of data objects. In order tohandle the uncertainties in similarity values arising from unknown property values, this numerical measure of similarity isthen generalized to a more general, interval-valued, function.

The combination of this semantic metric structure with the previously described logical constructs permits the reformu-lation of the pattern matching problem as the determination of the degree to which the database matches the pattern. Thismultivalued-logic problem—involving notions of partial implication and equivalence—is then treated as a database editingproblem, that is, the determination of the cost of a transformation changing the database into another one that matches thepattern from the strict viewpoint of classical logic.

We define data transformations as the composition of a number of basic database edit operations, or mappings, eachadding, deleting, or modifying one of the triples describing the database contents. The admissibility, or adequacy, of atransformation is defined as a function of the admissibility (or related cost) of its component editing operations. The degree ofmatching between database and pattern is then defined as the admissibility of the transformation with the highest admissibilityvalue (i.e., with the least cost).

We present also graph-based representations of patterns and databases and introduce various techniques to visualizedatabase-editing operations in terms of equivalent transformations of their graphical counterparts.

2

1 Introduction

In this work we consider the problem of determining the extent to which an imprecise database matches the specificationsof a pattern, i.e., a set of constraints on the nature of certain data items and their relationships.

Informally, the database may be thought of as a collection of objects that are linked by various predefined relations. At amore formal level, facts describing the existence of these objects and their relationships are expressed as the conjunction ofthe members of a set of instantiated logical predicates such as

Person ( Person-1),Event( Event-3),Participated( Person-1, Event-3),Employment( Person-1, Company-2, Position-9).

At a similar level of abstraction, patterns will also be conceived in terms of logic constructs, requiring the existencewithin the database of certain instances of objects and that of links, or relationships, between them. Typically, a pattern willcorrespond to a closed first-order-logic expression involving only the existential operator, such as

∃ x, y, z, . . .Person (x) ∧ Person (y) ∧ Event (z) ∧ . . . ∧ Participated (x, y) ∧ Participated (x, z) ∧ . . .

From such a logical perspective, the pattern matching problem may be regarded as that of finding a correspondence (variablebindings) between the variablesx, y, z, . . . and selected database objects such that the resulting instantiated pattern predicatesare, indeed, among the assertions contained in the database. We may also think of the specification of this correspondencebetween variables and objects as a constructive proof that the database implies the pattern. This perspective is, in fact, thebases for a number of logical programming languages and of logical approaches to database representation and manipulation[Col90].

The databases that we will have to consider are, unlike most familiar examples of data repositories,imprecise anduncertainin the sense that either the values of properties of certain objects or the existence of relationships between them may not beprecisely specified. The formal representation models underlying these databases must, for example, provide abstract andcomputational mechanisms to represent knowledge such as:

“The father of Person-1 is eitherPerson-10 or Person-11,”“The age of Person-1 is greater than 15.”

The databases that will be the object of our attention have also another unusual characteristic. In this type of repositoriesit is often convenient to regard the matching requirements expressed by patterns aselastic constraints that may be satisfiedto various degrees. For example, a requirement to match the pattern “Person x is a Southern European who breeds attackdogs,” might be matched, albeit not perfectly, by an object of the typePerson who was born inCentral France (which isclose to or overlapsSouthern Europe) and whokeeps (but it is unclear whether or not hebreeds) wolves. In this extendedview of patterns—familiar to those concerned with retrieval of information in document databases—patterns and their relatedqueries do not express strict requirements that are either met or not met. Rather, patterns should be regarded as proceduresthat permit to rank the suitability of alternative variable-to-object assignments as potential solutions of a database-retrievalproblem. Correspondingly, the values of properties of objects in databases (e.g.,Southern European) and the nature of theproperties themselves (e.g.,breeds), should be regarded as elastic descriptions that may be met to various degrees.

It will be convenient, therefore, to think of patterns as mechanisms to measure thedistance, similarity, or resemblanceof potential solutions of a matching problem to a set ofideal or perfect matches. This similarity function, which reflectsthe semantics of the specific problem being considered, is the basis for the definition of numerical measures ofdegreeof matching. This conceptualization of the pattern matching problem suggests that the pattern-matching problem may betreated as a generalized logical program, that is, as a procedure to search a space of potential solutions and to rank theirsuitability [RE98, DP88], which extends the theorem-proving approaches of logical programming.

In this paper we propose an approach to find, among all possible variable bindings, the best match to a pre-specifiedpattern in the sense that there is no other variable binding having a strictly higher degree of matching. We will define the

3

degree-of-matching function in terms of the modifications, orediting operations, required to transform a database into anedited counterpart that matches the pattern. Informally, we may say that a database edit is more admissible if results ina transformed database that is similar to the original database. Each edit operation will be associated, therefore, with anumerical value gauging itsadmissibility, or adequacy, that is a measure of the the extent to which the edited databaseremains close in meaning to that of the original being edited.

Our approach to database transformation, stemming from related ideas in graph editing [Wol94], is based on the char-acterization of complex database transformations as the composition of a sequence of several basic edit operations. Theadmissibility of such a sequence of transformations is defined as a function of the admissibility of the individual edits in thesequence. As there may be many such sequences leading from the original database to an edited version that matches thepattern, we will seek to determine that having the highest admissibility (i.e., that with the least cost). The admissibility ofsuch a sequence is then defined as the degree of matching between database and pattern.

Our presentation start in Section 2 with a discussion of the major issues that might be considered when databases arerequired to represent imperfect knowledge about real world systems. Having established the key considerations that must beaddressed in any significant treatment of the pattern-matching problem, we proceed to review the key elements of a numberof major approaches to data modeling. We show that it is always possible to represent data models produced by applicationof these approaches as a set of triples that specify values of properties of objects and relations. Basic database editingoperations are transformations, having different degrees of admissibility, that add, delete, or modify these triples. Sincevalues of properties may be imprecise or uncertain we discuss also, within this Section, issues germane to the representationof these types of knowledge.

Section 3 is devoted to the introduction of the important notion of generalized similarity relations (orsimilarity measures),which are the bases for the characterization of functions that measure the degree of matching between database and pattern.After reviewing the basic concept of similarity and its relations to notions of utility and cost, we discuss the derivation ofsimilarity measures that have a semantic basis in the sense that the degree of resemblance between objects reflects their sharedproperties as characterized by various knowledge structures (such as ontologies). The requirement to consider databaseswhere the values of properties may not be precisely known will lead us to generalizations of these measures that reflect theinconvenient nature of imprecise knowledge.

In Section 4 we formalize the notion of data model by characterizing the important notions of objects, properties, valuesets, and links. In response to requirements to represent uncertainty, we introduce two major classes of triples: describinguncertainty about the nature of linked objects in one case and about values of object properties in the other.

We stress also the nature of objects as members of special object classes that, in addition to provide characterizationsuch as those found in ontologies, permit determination of their similarity to other objects in the class. We introduce alsograph-based representations of our models, showing that, without loss of generality, the information in our data model maybe represented by means of two basic type of nodes (i.e.,objectsandrelations), which are connected by edges describingtheir properties.

Finally, in Section 5, we introduce the major structures required to determine the degree of matching between databaseand pattern. On the basis of similarity measures associated with each type of object or relation, we define the degree ofadmissibility or adequacy of basic editing transformations, that is, of operations that add, delete, or modify triples in thedatabase. This definition is the basis for the characterization of the admissibility of a sequence of such edits as a functionof their individual degrees of admissibility. The degree of matching between database and pattern is then defined as themaximum of the degrees of admissibility of all sequences that transform the former into the latter. We conclude the work bydiscussing relations between this database-editing approach and related methods for the measurement of the cost of graph-editmappings.

2 Abstract data representations

The main objective of this section is the discussion of various data modeling approaches, and their major conceptual tools.In particular, we describe how the underlying data representation structures of those approaches may be themselves modeled,at a more primitive level, as the conjunction of triples that assert that aproperty or attribute of anobject has a certainvalue,or that anobject has a specificrelationship to anotherobject. For example, a triple such as

4

(Age, U33, 25yrs),

indicates that the value of the propertyAge of the objectU33 is 25yrs, while the triple

(Son, U33, U323),

indicates that the objectU33 has the directed relationshipSon to the objectU323. While formally equivalent from a purelylogical point of view, these two types of triples require a somewhat different computational treatment, which we discuss inSection 4 .

Our presentation, unlike that of conventional treatments of the subject, places particular emphasis on problems associatedwith the representation of imprecise and uncertain knowledge and information in the context of pattern-matching problems.

2.1 Imprecision and Uncertainty in Databases

Although a convention is yet to be reached among practitioners as to the proper usage of the termsimprecisionanduncertainty, we will describe imprecision, throughout this work, as the inconvenient characteristic of information that doesnot allow identification of the value of an attribute or does not permit to uniquely identify objects having a relationship.Rather, it will be assumed that imprecise information permits to identify a set of possible attribute values or a set of objectswhere the actual attribute value or the related object lie, respectively.

The following assertions exemplify imprecise knowledge:

“The age of personP-3 is at least20 years ,”“The father of personP-99 is either personP-100 , or personP-102 .”

The key feature of imprecision is the inability to specify actual values of attributes or to permit unique identification ofobjects allowing, rather, to identify a subset of the range of a particular attribute or relation.

In our discussion, we will also discuss possible imprecision about the nature of the links that define the property beingdescribed or the nature of the relation between two objects. It may only be known, for example, that

“PersonP-46 participated in eventE-3 as anaccomplice ,”

while better knowledge may reveal that the nature of the participation isgetaway driver , i.e., a more precise character-ization of the link between person and event.

One of the inconvenient features of imprecision in the context of data modeling problem is the inability to uniquelycharacterize objects by a combination of the values of selected properties or attributes. Since these values may not be known.it is no longer possible to rely in the notions ofkey, or key attribute. Rather, we will assume that objects are given a uniqueinternal identifier being aware of the possibility that two such computational entities may be found, after additional evidenceis collected, to correspond to the same real-world object. To deal with such eventualities we will assume that a special relation;same-as exists, in some instances, linking certain objects to other (possible multiple) objects. The intended semantics ofthis relation is that each of the latter objects maybe identical to the former as, for example, in

“PersonP-46 same-as PersonP-57 ,”

indicating that, in the real world, the objectsP-46 andP-57 describe the same person. The set of objects related to anotherobject by thesame-as relation is called thesame-set and, depending on available knowledge, although non-empty,1 maycontain any number of different computational entities. We will further assume that, once imprecisions about the identity ofan object have been resolved, the database will be modified by unification of the corresponding computational objects andelimination of instances ofsame-as links.

In the context of this work, the termuncertainty will be occasionally employed to describe probabilistic knowledge aboutthe value of an attribute or about the identity of linked objects as exemplified by:

1An empty set of alternatives simply means that there is no evidence of possible duplication by internal objects and that, therefore, no link is required.

5

“The probability that thetotal-rainfall will be 50in is 50%,”“The probability that thefather of P-90 is P-3 is 90%,”

essentially defining elements of a probability distribution in the range of a property or relation. (i.e., the conditional proba-bility that a property has a value, or the conditional probability that an object is related to another object,given the evidence).

The basic difference between imprecise and uncertain knowledge is that the former simply specifies a subset of possiblevalues or related objects2 while the latter fully specifies a probability distribution over the range of the property or relation.

In this work, primarily for the sake of clarity, we will confine our attention to pattern-matching problems involvingimprecise information. While the extension to uncertain information is relatively straightforward, it involves the introductionof functions defining distances and similarity between probability distributions resulting in the consideration of issues thatare not central to the problem being addressed.3

2.2 Data modeling Approaches

We will examine several data modeling approaches identifying abstract mathematical and logical structures that capturethe essential aspects of linked data structures that are germane to the pattern-matching problem. The purpose of this exerciseis to show that data models constructed with these computational tools might also be represented as collections of triples thatdescribe the values of attributes of specific database objects.4

2.2.1 Relational models

The relational database model [Dat95] is perhaps the formulation that is most readily amenable to conceptual abstrac-tion by means of logical and mathematical structures. In spite of its simplicity, however, some details of the underlyingrepresentation—such as the concept ofkey—are often the source of confusion.

The relational model is based on the representation of various objects by tables listing the values of the properties orattributes of objects. For example, a relation of the typeTransaction may be represented by a table of event-instancesdescribing the event characteristics:

Date Paid by Payee Transaction Type Amount

October 1, 1888 David Copperfield Oliver Twist Check $500.00September 23, 1786 Robinson Crusoe V. Friday Cash $45.56September 6, 1830 C. Ahab M. Dick Stock Transfer $2100.00

Other tables are employed, as is well known, to describe attributes of other objects such as, in the example above, personsand transaction types. The resulting collection of inter-related tables provides required description of various objects andentities and their relations.

Among alternative formulations, the relational model affords the best compromise between the requirements implicit inthe representation of complex links between real-world entities and the need to rely on a simple abstract structure capableof being captured by labeled graphs. In particular, we may note that relational tables, beyond identifying values of proper-ties and attributes, identify, as part of the database schema, the properties and attributes themselves (e.g.,Date , Payee ).

2This specification may be thought of as stating that the probability that the value of a related attribute or object lies on some subset is 1, i.e., a partialspecification of a probability distribution.

3For the same reason, we will also avoid discussion of generalized schemes, such as theDempster-Shafer calculus of evidence, (which relies on acombination of imprecise and uncertain representations of partial knowledge) nor we will deal, at this time, with extensions to thefuzzy domain (whichare concerned with the representation of vague information.

4The representation of data by triples and the derivation of equivalent graph representations are well known [BH93, DG00]. The nature of the calculusof binary relations. from a historical perspective, has been discussed by Pratt [Pra92].

6

Furthermore, it is clear that individual entries may point to instances of other represented objects (e.g., “C. Ahab”) or toparticular elements of certain primitive sets ofvalues(e.g., a particular date).

The relational model, however, was conceived in the context of classical applications where the characteristics of entities(i.e., specific “rows” in a relational table) are uniquely identified, within the data model, by the values of a subset of someattributes (i.e., thekey), 5 In applications where the attributes are known under conditions of imprecision and uncertaintythe value of attributes in the key may not be precisely known, thus preventing the identification of the relational entry (or”relationship”) by means of the functional mapping that is implicit in the key→ attribute mapping. For example, the nameof a person maybe only partially known, and other identification characteristics (such as numbers of identity documents)might be unknown or unreliable. Furthermore, it is possible that two different entries in a relational table correspond to thesame object in the real world (e.g., a person is represented using two entries with different aliases or identity documents).Conversely, two identical imprecise entries might correspond to different objects in the real world.

To be able to refer, under such conditions, to each specific tabular entry (henceforth called atupleor relationship) we needto be able to uniquely identify it. In our previous example, after identifying specific ”person” entries in the relationPersonsdescribing attributes of individuals (some of which may describe the same individual) and providing a unique identifier toeach relationship in the relationTransaction, we have the new relation:6

Trans-ID Date Paid by Payee Transaction Type Amount

T-1 Oct. 1, 1888 P-5 P-9 Check $500.00T-2 Sep. 23, 1786 P-78 P-97 Cash $45.56T-3 Sep. 6, 1830 P-112 P-323 Stock Transfer $2100.00

From this representation, we may readily describe the relationships of the data model by triples such as:

(Payee, T-2, P-97),(Amount, T-2, $45.56)

which identifies how a type of link such asPayee or Amount relates a specific instance of an object (e.g.,T-2 ) withanother instance of an object (e.g.,P-97 ), or with a value (e.g., $45.56). This is a binary relation between object-instancesor between object-instances and values of the type identified by the name of a particular attribute or property.

2.2.2 Logic-based representations

Approaches based on logical representations such as those inspired in the LISP and PROLOG programming languages,[McC60, Col90] are based on symbolic representations of data, such asp(O1), q(O2, O3), r(O11, O12, . . . , O1n), interpretedas knowledge about the properties of various instances of objects.

Although there have been important extensions of these languages that permit explicit representation of semantic knowl-edge [BG94] (e.g., the meaning of the arguments inn-ary predicates), in general, the interpretation of symbolic expressionsis usually not made clear or is hidden in the program itself.

There is a simple conceptual mapping, however, between relational entries and collections of logical representations ofinstantiated predicates. For example, a fact represented asr(O11, O12, . . . , O1n) may be represented in a relational settingas an entryO11, O12, . . . , O1n) in the relationr. Since there may be multiple instances of known validity of the relationrand since the values of the related variables may not be precisely known, it is necessary again to provide a unique systemidentifier permitting to differentiate between two known instances

r(O11, O12, . . . , O1n) and r(O21, O22, . . . , O2n) ,

5Our discussion of the relational model and other data-modeling approaches has been simplified, focusing solely on major issues related to graph-basedmodeling of data structures.

6The identifiers are prefixed by a letter intended to identify the type of object being identified or linked. This choice was motivated by the desire topreserve some understandability of the example. In general, identifiers are introduced to uniquely identify a relationship betweendata entities.

7

of validity of the predicater, by assigning a unique identifierrID to each instance.

From such a conceptual mapping, it is easy to see that symbolic representations such asq(O2, O3), may be alternativelyrepresented as triples of the form

(Arg1, qID, O2), (Arg2, qID, O3)

whereqID identifies the particular instantiation of the predicateq being represented using knowledge about its properties orattributes Argi.

7

2.2.3 Frames, logical records, and similar structures

Data-based structures relying on representation of properties of entities by the values of their properties are conceptuallysimilar to instances of relations.8

Logical recordsdescribe the values and attributes of data objects while identifying related object instances and the natureof such relationships. These descriptions (i.e.,fields, slots), play the same role as column headers in relational tables. Fromthis observation, it should be clear that any such structure may be represented as a collection of triples of the type(Field,RecordID, Value) whereRecordID is the identifier of a particular record (or frame),Field identifies a particularfield or slot, andValue is the value of the corresponding property or attribute associated withField .

2.2.4 Entity-Relation Models

Models based on the entity-relationship (ER) model of Chen [Che76] rely on the two basic structures defining properties ofreal-world entities (i.e., entities) and their relations to other entities. In either case it is clear that specification of values ofproperties or attributes of entities

(Attribute, EntityID, Value),

, or statements of relations between instances of entities

(Role, EntityID1, EntityID2) ,

whereRole identifies the nature of a link in an-ary relation, by means of triples provide a description of entities, their links,and the nature of their links that fully captures the information described by ER structures.

2.3 Entity, Relations, Values

Our brief review of major data modeling approaches shows that their underlying representation approaches may be mappedinto an equivalent collection of triples describing thenatureof a relation between apair of objects, or as collection oflabeledinstances (orrelationshipsof binary relations.

In the rest of this paper, we will assume that there exist two basic types of relations, which—following the nomenclatureof the Entity-Relationship model of Chen—will be calledentitiesandrelations. There may be, as we saw before, multipleinstances of entities and of relations (each such instance is called arelationship). We will assume, in addition, that instancesof entities and relations are assigned unique identifiers that permit to distinguish between them as internal database objects.

For example, a database may contain an entity calledPerson , which may be described, in relational terms, as

Person (Name, DocumentNo, Nationality, Profession, Father, Mother ),

that is asn-ary relation defined in the cartesian product of the set of all possible names, possible document numbers, etc. Aninstance of this entity, on the other hand, corresponds to a uniquely identified element of information about the attributes of

7The semantics of both relational identifiers and that of the related variables (or arguments) is sometimes made explicit through structures that play therole of a database schema.

8This comment reflects solely concern with graph-based representation of basic links between data objects.Framesand similar formalisms also providetools for the specification of complex constraints beyond the current scope of this paper.

8

a person:

P28 (C. Ahab, 545678900, USA, Ship Captain, P455, P321 ),

which may be, as we already saw, represented by a collection of triples such as

(Name, P28, C.Ahab ) , and(Father, P28, P455 ) .

The first component of these triples are, in the case of entities, referred to as thepropertiesor attributesof the entity-instance identified by the second component having a value identified by the third component. From the simple examplesgiven above, we may note that, in some cases, the values of properties and attributes (e.g., theFather of aPerson ) is givenby the identifier of another database entity (i.e., anotherPerson ). In many other instances, however, the value of an attributeis simply a member of a special set of primitive objects, such asstringsor numbers, not requiring special identification. Wewill say that these sets are the set ofprimitive values. When the attributes of an entity-instance are uniquely identified, thensuch values are either primitive values (e.g.,Names areStrings ), entity-instances (e.g., theFather of aPerson ), or, insome cases, relation-instances (e.g., the relationship between a specificBirthPlace and a specificBirthDate describingproperties of theBirth of aPerson ).

Relations such as

Transaction (Seller, Buyer, Date, TransactionType ),

and their instances such as

T65 (P34, P767, 20Aug1999, Barter),

may likewise be represented as collection of triples such as

(Seller, T65, P34 ) , and(Date, T65, 20Aug1999 ) .

Often, the attributes or properties of a relation are described asroles. It should be clear from these simple examples, that thevalues of properties, attributes, or roles of a relation-instance may be primitive values, entity-instances, or relation-instances.

Consideration of these examples also show that modeling of a relation between real world objects as an entity or as arelationship is clearly a matter of modeling preference rather than one of conceptual or semantic necessity. Whether ornot Person is thought of as an entity having instances characterized by values of attributes of Personinstances, or as arelation between properties is largely a matter of modeling preference that—although important from other viewpoints—isnot relevant to our graph-modeling, pattern-matching. or uncertainty-representation needs. In what follows, we will thereforeassume that, when data is precise, database contents may be represented in terms of triples of the type

(Link, Object, Linked-Value ) ,

where tt Link is the name or label of an attribute, property, or role,Object is an instance of an entity or a relation, andLinked-Value is either an entity or relation-instance or a specific value in a primitive value set.

It is important to note that the first component of a data triple, orlink, is not instantiated, while the second component isalways an instantiated entity or relation. The third component, as we have already remarked, is either an entity-instance or avalue.

To allow for representation of imprecision we will generalize our assumption assuming that imprecise databases may berepresented as collections of triples of the form

(Link, Object, Linked-Value-Set ) ,

, representing knowledge that the value of the propertyLink of the entity or relation-instanceObject is not known preciselybut is only known to lie in the setLinked-Value-Set ), which may be a subset of primitive values (e.g., names startingwith “N”), or a subset of object-instances.

We note, in passing, that theLinked-Value-Set for instances of the distinguished relationsame-as , introducedin Section 2.1, is thesame-set for the linked object, i.e., a set known to contain, at least, another computational entityrepresenting the same real-world object.

9

3 Similarity

The notion ofsimilarity or resemblanceis central to the approach presented in this paper. Similarity measures providethe bases for the determination of theadmissibilityof certain transformations of a database into another that meets thelogical conditions expressed by a pattern. The notion of admissibility, which may be thought of as being dual to the conceptof cost,9 is introduced to indicate that certain database modifications are more permissible than others. The basic idea isthat transformations resulting in a similar database are more admissible (i.e., less costly) than those resulting in substantialdifference between the original and transformed data. Thedegree of admissibilityof a database transformation is definedin terms of the numerical measures of similarity, which are themselves the counterpart of the notion ofdistance. Similaritymeasures, mapping pairs of objects into a numeric value between 0 and 1, provide a desirable foundation for the developmentof a rational theory of database editing, not only because they are related to notions of cost, utility, and admissibilty but alsobecause they provide the basis to extend classical logical relations of inference to approximate counterparts [Rus91a].

3.1 Similarity Measures

Similarity functions are measures ofindistinguishabilitythat may be thought of as being dual of notions of the notion ofbounded distance (i.e., a distance function taking values between 0 and 1). A similarity measure assigns a value between 0and 1 to every pair of objectso ando′ in some spaceX. Typically, similarity measuresd may be obtained from knowledgeof distance functionsd by simple relations [RBP98] such asS = 1− d . The similarity of an objecto to itself is always equalto 1 (corresponding to a distance equal to zero), while the minimum possible value for a similarity function is0.

The advantage of the use of similarities in lieu of distances lies on their advantages as the foundations of logics of utility[RE98]. which provide the bases to combine on a rational bases, measures defining utility, admissibility, and cost fromvarious perspectives. A detailed discussion of the notion of similarity is beyond the scope of this paper. We will limitourselves to point out the important properties of similarity (or generalizedequivalence) functions:

Reflexivity: S(x, x) = 1, for all x in X,

Symmetry: S(x, y) = S(y, x), for all x, y in X.

Transitivity: S(x, y) ≥ S(x, z) ∗�S(y, z), for all x, y, z in X, where∗� is one of a family of numerical binary operatorscalledtriangular normsor T-norms.

The transitivity property is of particular importance as it extends the transitive properties of classical deductive methods(i.e., the transitvity of the inferential procedure known asmodus ponens) into multivalued logic schemes capable of modelingnumeric measures of implication and compatibility [RE98].

3.2 Similarity and Utility

The metric notion of similarity is closely related to utilitarian notions such ascost, admissibility, andutility [Rus91b].

Utility functions defined on a domainDomO assign to every objectO of that domain a numberu(O) between0 and1,which measures the extent to which the situation represented by that object is good or desirable.

In robotics problems, for example, the position of a mobile autonomous robot may be said to be good because it is “notin danger of hitting an obstacle,” or “because it is in communication with other robot” [SKR95]. In a control-systems ordecision problem, different actions may be good or bad, from various viewpoints, depending on their outcome (i.e., the stateof the system being regulated).

Utility functions provide, therefore, a convenient way to rank the desirability of events, situations, and objects in a numer-ical scale. As such, they have been a central concept in modern decision theory [Rai68].

9The numerical degree of admissibility of a database transformation is the complement of the notion of cost in the sense that low costs correspond tohigh admissibility values while high costs correspond to low admissibility values.

10

If several utility functionsu1, u2, . . . , un are considered when comparing pairs of objects in a domainO, then a similarityfunction may be defined by composition of criteria defined for each similarity [Val85]:

S(O,O′) = mini

[ |ui(O)� ui(O′)| ] ,

where� is the pseudoinverse of a triangular norm., and where|a�b| stands formin(a�b, b�a). Basically, the above result,known asValverde’s Representation Theorem states that two objects, events, or situation are similar f, from every importantviewpoint, they have similar utility values.

The notion of utility is also the basis for certain multivalued logics [DP88] and, more importantly, for the derivation ofproof procedures that are directly applicable to problems such as pattern matching (i.e., as this problem is equivalent asproving that the data implies the pattern). Recent results indicate that these logics are closely related to the similarity-basedlogics underlying our approach to pattern matching [RE98].

Conversely, similarity measures may be employed to derive measures of cost and admissibility. The rest of this Section isdevoted to such derivation in the context of a pattern-matching problem.

3.3 Numerical measures of admissibility

We discuss now approaches to the definition of similarity and admissibility measures in the context of pattern-matchingproblems.

Our simplest scheme for numerical assessment of the admissibility of edit operations is based on the similarity measuresbetween leaves of an ontological DAG that measure the extent to which leaf nodes share ancestors in the ontology. Thisnumerical measure between nodes is later extended to an interval-based admissibility measure between arbitrary nodes in theontology. This scheme, first proposed by Ruspini and Lowrance in the context of the development of the SRI’s SEAS system[RL02], is also similar, in spirit, to approaches to define semantic distance between concepts on the basis of the knowledgeprovided by a generalized thesaurus [BH01].

These ontology-based measures of similarity permit only, however, to gauge the resemblance between different typesof unlinked objects. Pattern-matching problems, however, require the consideration of similarity between complex linkedstructures where the similarity between two such structures depends on the nature of the links and that of the attributes of therelated objects. To address this problem, we discuss mechanisms for the derivation of complex similarity measures in termsof simpler constructs,

In our discussions, we will, mainly for the sake of simplicity, assume that every node in an ontology is relevant to themeasurement of similarity between types. In general however, the measurement of similarity between types is made on thebasis of a connected subgraph of the ontology as many properties might be irrelevant to certain types of comparison (e.g.,certain items of furniture and cows have both four legs but this shared property is generally irrelevant to the characterizationof similarity between these rather different objects).

To complete our presentation of the basic metric and utilitarian concepts supporting our approach, we show that, in general,measures of admissibility are non-symmetric in their arguments in the sense that—in pattern-matching problems—the degreeof admissibility, for example, of replacing the linkL by the linkL′ is not the same as that of replacing the linkL′ by thelink L (e.g., replacingAnimal by Cowshould have a cost as it requires to assume unavailable evidence while replacingCow by Animal does not entail additional assumptions). On the basis of this type of consideration we introduce belownon-symmetric, interval-based, measures of admissibility that capture the notion that any database transformation may havecertainnecessaryandpossiblecosts.

3.3.1 Semantic Distance, Similarity, and Ontologies

Being closely related to the notion of distance, several measures of similarity have been proposed to address problemsranging in nature from sociology and psychology to pattern recognition [SH81, BP92]. Our approach to the description ofsimilarities between basic model objects exploits the availability of domain knowledge in the form of ontologies of variousobject domains and ontologies of relations.

11

Ontologies provide semantic bases for the definition of notions of distance, resemblance, and similarity between concepts.In most applications of knowledge-based concepts objects belong to certain classes, ortypes. Ontologies, through classsubsumption structures, corresponding to a DAG. permit to define a distance function between elements of the ontology, asdone by Wolverton [Wol94], on the basis of the length of the paths linking two elements of the ontology.

While this type of provides a simple mechanism to gauge conceptual proximity on the basis of domain semantics providedby ontologies, the identification of a framework to measure the extent by which a representation of data objects and theirrelations match a reference pattern require development of a more sophisticated methodology. Several considerations supportthis conclusion:

• The process of pattern matching is inherentlynonsymmetric, since the editing operations required to transform a datapattern into a reference pattern is different from those accomplishing the inverse mapping from reference pattern todata.

For example, the admissibility of the operation replacing an object of the typeEuropean (identifying a set ofNationalities ) by another (in the pattern) of the type tt Italian should be 1 (i.e., the associated cost should be0), as the data is more specific than the requirement expressed by the pattern, since every possible instance of an Italianis an instance of an European. The replacement of an value node of the typeEuropean by another with a value ofItalian to match a pattern requirement , should, in general, have a measure of admissibility that is strictly smallerthan 1 since it assumes additional, unavailable, knowledge.

• In general, the admissibility of editing operations replacing a value-set with another value set cannot be uniquely deter-mined. Consider, for example, the problem of computing the costs associated with replacing the labelSouthernEuropeanof a value-set node (describing the set where the attributeNationality of an instance ofPerson is known to be)by the labelWesternEuropean ) might very well have a potential admissibility of 1 (i.e., a cost of zero) as thereare some nationalities (e.g.,Spanish ) in the intersection of the value sets. On the other hand, the different natureand extension of the classes indicates that there may be apotential cost (corresponding to a admissibility value strictlysmaller than 1) associated with that replacement (e.g., additional information may reveal that the person was Bulgar-ian). This range of possibilities suggests that, in many cases, the editing cost may be better represented by anintervalof possible valuesrather than by a single number.

• The measurement scheme should reflect the extent to which different classes of objects share important commonproperties. Ontologies, identifying various relations of set inclusion among the classes represented by ontology types,provides a solid foundation for the measurement of resemblance on a semantic basis. Path lengths on the ontologicalgraph, however, do not accurately reflect, in most instances, this semantic knowledge.10

We propose next a scheme to measure costs by subintervals of the [0,1] interval of the real line representing, on the basisof knowledge provided by ontologies, thepossible costsassociated with basic editing operations.

3.3.2 Similarity between leaf nodes

The relations of set inclusion between nodes in an ontology may be represented in a number of ways by vectors describingwhether a node of some type is the ancestor of another node. Ruspini and Lowrance [RL02] suggested a representation ofthej-th node in terms of a vector having the lengthn of the cardinality of the ontology with thei-th component of that vectorrepresenting, whether or not thei-th node is an ancestor of thej-th node, i.e.,

vi(Nj) ={

1, if nodei is an ancestor of nodej.0,otherwise

On the basis of this representation the following similarity measure may be defined

S(Ni, Nj ) =

{<v(Ni),v(Nj)>−1

max( ‖ v(Ni) ‖2,‖ v(Nj) ‖2 )−1 if either Ni or Nj are not the root node,1 otherwise

10In an ontology ofAnimals , for example, the distance between the classesDog andCat is strictly smaller than that between any particular breed ofdog and any particular breed of cat.

12

wherev(Ni) andv(Nj) are the just introduced binary vector representations, of the nodesNi andNj .

This similarity measure is a symmetric function taking values between0 and1, such that the similarity of a leaf node toitself is always equal to1.

On the basis of this measure of similarity, it is now possible to derive interval measures that gauge thenecessaryandpossibleadmissibility of graph-editing transformations.

3.3.3 Complex Similarity Measures

While ontologies permit the construction of semantic-based similarity functions between basic objects (e.g.,Countriesby Type of Economy) and between values of attributes (e.g., ,Age), it is often the case that many of the structures found ina typical pattern-matching problem, involving complex links between primitive objects may not be amenable to this type oftreatment.

Consider, for example, the set of linked objects characterizing an illegal transaction such asMoney Laundering.Unless this type of object has been subject to some form of ontological characterization, it should be clear that any measureof similarity between objects of this type need to be based on the similarity of attributes of corresponding objects in eachstructure.11

These complex structures may be characterized, however, through logical expressions, such as

Person (x) ∧ Person (y) ∧Money-Transfer (z) ∧∧ Depositor (x, z) ∧ Payee (x, z) ∧ . . .⇒ Money-Laundering (x, y, z. . . .) ,

which, can be employed as the basis for the definition of a similarity measure between money-laundering events as a functionof the similarities between the various basic ground predicates and the logical operators that interconnect them [Rus91c].

In general, the computation of these complex measures is straightforward. In some cases, however, as when trying tomeasure similarity between say, people, by the type of company they keep, application of the above approach results in adefinition that depends on the very measure being defined, as the similarities between associates depend on the nature of thevery people being compared (since human association is a symmetric relation and the people being compared are themselvesassociate of their associates). While the required measure may be usually derived by iteration,12 it is important to reduce thedefinition complexity as the underlying computational problem may be intractable. In these cases, common when consideringtransitive relations, it may be required to limit the extent of the structures being compared to reduce such computations.

3.3.4 Generalized Similarity Measures

In order to determine, on a sound formal basis, the admissibility of each editing operation in terms of the nature of the objectsand links involved it is necessary to derive a general metric characterizing the acceptability of the outcome of an editingoperation from the viewpoint of the reference pattern being matched.

Ruspini [Rus91a, Rus91c] proposed, in the context, of studies about interpretations of fuzzy-logic concepts the measure-ment of the similarity between two sets by extension to a logical framework of measures, notably the well-known notion ofHausdorff distance, from the theory of metric spaces. The connection between these concepts and utilitarian interpretationsof certain possibilistic constructs has been studied recently to a considerable extent [RE98].

The basic idea of this approach is to measure the degree by which one subset of a setX intersects another by the similarityof their closest pair of points, that is,

Π(A,B) = maxo∈A

maxo′∈B

S(o, o′) .

11In this regard, it is important to note that ontologies, if available, already summarize the important distinctions between objects in terms of theirsignificant attributes. The characterization of an object, such as aPerson in terms of its ontological lineage as opposed as a vector of attributes is a matterof choice. In the case of complex, domain-specific, structures, however, it is reasonable to assume that ontologies will not be readily available, thus requiringa different similarity-measurement approach.

12The definition of similarity in these cases is equivalent in character to the “fixed point” definitions of classical logic. To use another analogy, we maysay that the similarity measure is being defined through an implicit equation.

13

When this number is one, then the sets have at least one point in common. On the other hand, a large value of thepossibilitymeasureΠ(A,B) indicates that all points ofA are far apart from all points ofB.

On the other hand, the functionI(A | B) = min

o∈Bmaxo′∈A

S(o, o′) ,

defined also over pairs of subsets ofX, measures thedegree of inclusionof B in A, i.e., the extent of the minimal metricneighborhood ofB that enclosesA (in the sense of set inclusion).

The nonsymmetric metricI will be useful in the definition of costs of basic editing operations as it measures the extent bywhich the concept represented by the classA needs to be extended (or “stretched”) to encompass that described by the classB. The metricI is a measure of set inclusion since a value ofI(A | B) equal to one indicates that every member ofB is alsoa member ofA, i.e., thatB is included inA. It is clear from the definitions ofI and ofΠ that it is always

I(A | B) ≤ Π(A,B) .

On the basis of these considerations, we can now define the interval-valued admissibility metricAd as

Ad(A | B) = [I(A | B), Π(A,B)] ,

in terms of the values of possible costs associated with the replacement of an object of the classA by an object of the classB.

4 Data Modeling

After presenting the basic construction blocks of our data model and reviewing the important logical, metric, and utilitarianstructures that relate them, we are now in a position to make a formal definition of a general data model for the representationof imprecise data and knowledge. This model permits representation of information such as

“There exists an European smuggler involved in EventE,”

which is imprecise since it fails to identify the smuggler and while not precisely disclosing its nationality as well as that ofconstraints such as

“The ending time of EventE should be more than 3 days after the starting time of eventE′.”

4.1 Abstract Data Model

We start our discussion with a characterization of the various knowledge and data components of a data repository.

Definition 1: A data model is a 4-tupleM = ( Obj,Vals,Link,Data) ,

where

1. Objects: Obj is a nonempty set, called the set ofobjects.

Objects correspond either to instantiated entities (e.g., a particular instance of the class or domainPerson ), or to aparticular instance of a relation (e.g. , a relationship, such as a particular instanceE − 346 of the relationEvent ).

It is often the case that objects are related by ontologies that describe both set-inclusion relations between objects aswell as the conditions upon object properties that make possible the differentiation of subclasses (e.g., the propertyof Animals that makesVertebrates different fromInvertebrates . We will assume that the full setObj ofobjects corresponds to a non-empty class calledThing , which, in most applications, corresponds to the root node ofan ontology of objects.

14

2. Values: Values are individual members of any set in a collection of basic primitive typesVals, such as numbers, strings,or members of pre-defined discrete sets such as

BalkanCountries = { Albania, Bosnia, . . . , Yugoslavia }

which permit to define the values of properties and attributes of objects.

3. Links: Link is a nonempty set, called the set oflinks. We will assume that there exists a partial order. defined in thesetL with a unique linkUniversal such that

l . Universal ,

for any link inLink.

Links correspond to various relations such aspropertiesor attributesof entities (e.g., theNameof a Person), or torolesin relationships (e.g., thePayee in a monetaryTransaction ). As we have already pointed out, there is considerablelatitude in data modeling as to what constitutes an entity (related to values of its attributes) and to what is a relation(related to other objects or relations). In the context of our model, links are different from objects in that the latter areinstantiated and identified as a specific data object of a certain class (its domain), while the former are generic and arenot instantiated.

The introduction of the order. in Link is intended to provide a mechanism to specify whether a linkL1, such astt is-Financial-Participant, is more specific than another linkL2, such asis-Payee .

In most applications, this order is provided by an ontology of relations having as root theUniversal . which holdsbetween any pair of objects. In consideration of this fact, we will call all elementsl of Link that do not have a differentlink l′ such that

l′ . l ,

leaf linksof Link.

4. Triples: Simple triples of the first kindhave the form

( link ,object 1,object 2 ) ,

wherelink is a leaf link, andobject 1 andobject 2 are objects inObj.

Simple triples of the first kind are introduced to model classical relations between pairs of objects. Simple tuples ofthe first kind do not allow imprecision as to the nature of the related object nor do they provide for lack of preciseknowledge about the relation being represented. The set of all the simple triples of the first kind is denotedT 1

s .

Simple triples of the second kindhave the form

( link ,object 1, value ) ,

wherelink is a leaf link, andobject 1 is an object inObj, and wherevalue is a member of some of the attributeclasses inVals(e.g., a number or a string). The set of all the simple triples of the first kind is denotedT 2

s . Simpletriples of the second kind are introduced to model simple relations between a specific object and the value of a specificproperty.

We will assume that the setsT 1s andT 2

s of simple triples have a metric structure defined by means of similarityfunctionsSim1

T andSim2T , respectively. We discuss below an approach to the definition of such similarity measures

between triples.

Complex triples of the first kind, or triples of the first kind, for short, have the form

( Link ,object 1,Object 2 ) ,

whereLink is a link inLink, andobject 1 is an object inObj, andObject 2 s a non-empty subset of objects inObj.Complex triples of the first kind are general structures that provide the full functionality required for the representationof imprecise relationship about objects, such as the information conveyed by the following statement:

15

“Either personsP-1 or P-6 arerelatives of personP-9 ,”

indicating lack of precise knowledge about the nature of the relation (i.e.,relative versus a specific description ofthat relation such asson , and about who is related (i.e., eitherP-1 or -P6 ). The set of all triples of the first kind isdenotedT 1.

Complex triples of the second kind, or triples of the second kind, for short, have the form

( Link ,object ,Value-set ) ,

whereLink is a link in Link, andobject is an object inObj, andValue-set s a non-empty subset of valuesin some set in the collectionVals of primitive types. Complex triples of the second kind are general structures thatprovide the full functionality required for the representation of imprecise knowledge about the values of properties orattributes of objects, as exemplified by the following statement:

“The main profession or theside-business of personP-9 is eithercarpenter or plumber ,”

indicating lack of precise knowledge about the nature of the business activity carried by a specific instance of an objectof the typePerson and ignorance about the relative importance of such activity among all the business endeavors ofthat person The set of all triples of the first kind is denotedT 2.

Any triple of the first kind( Link ,object 1,Object 2 ) ,

maybe associated with a set of simple triples of the first kind containing all triples of the form

( link ,object 1,object 2 ) ,

such that

link . Link ,

object 2 ⊆ Object 2 .

The set of all such simple triples represents all possible values of a precise relationLink between the objectobject 1

and some object in the set of objectsObject 2 (i.e., if we had better information we should be able to precise what isexactly the relation and what is the object related toobject 1).

Similarly, any triple of the second kind

( Link ,object ,Value-set ) ,

maybe associated with a set of simple triples of the second kind containing all triples of the form

( link ,object , value ) ,

such that

link . Link ,

value ∈ Value-set .

The set of all such simple triples represents all possible values of a precise attributelink between the objectobjectand some value in the setValue-set (i.e., if we had better information we should be able to precise what is exactlythe value of a precise attribute of the objectobject ).

The nature of the association between complex triples and sets of simple triples permits now, employing the approachoutlined in Section 3.3.4, to extend the similarity measuresSim1

T andSim2T , defined over the sets of simple triplesT 1

s

andT 2s , respectively, to the interval-valued admissibility measures

Ad1T = [ I1

T .Π1T ] ,

Ad2T = [ I2

T .Π2T ] ,

defined over the sets of complex triplesT 1 andT 2, respectively.

16

5. Data: The dataData is a tuple(Data1,Data2), whereData1 is a set of triples of the first kind andData2 is anon-emptyset of triples of the second kind.

This formal definition simply says that the database are a collection of triples relating objects and a non-empty col-lection of triples describing (possibly imprecise) knowledge about the values and attributes of objects. We require thelatter collection to be nonempty to assure that the information represented in the database is grounded on objects thatare described, to some extent, by their properties.

4.2 Similarities between triples

Our approach does not place any constraints on the nature of the similarity measuresSim1T andSim2

T between simpletriples leading to the definition of transformation costs by means of the admissibility functionsAd1

T andAd2T , respectively.

Among the many possibilities open for such definitions, however, there is a simple choice that defines similarity betweentriples as a composition of the similarities between each of the triple components. The following function beyond providing ameasure with the required semantics, is also noteworthy in that it makes the definition of similarity between objects dependenton the nature of the link being considered. For example, a link such asAge-of might prescribe that similarities betweenattributes taking numbers as values must compare those numbers along criteria based on their interpretation asages(i.e.,rather than asweights in pounds).

Focusing first on triples of the first kind, we will assume that there exists a similarity functionSimL defined between pairsof leaf links inLink, and that for every leaf linkl, there exists

1. A similarity functionSimlO defined between pairs of objects inObj (i.e., a way to compare possible replacements of

the second component of the tuple).

2. A similarity functionSiml

O defined between pairs of objects inObj (i.e., a way to compare possible replacements ofthe third component of the tuple).

On the basis of these similarity functions we can now define

Sim1T

((l, o1, o2), (l′′, o′1, o

′2)) = SimL(l, l′) ∗�Siml

O(o1, o′1) ∗� Siml

O(o2, o′2) ,

where∗� is a T-norm such thatSimL, SimlO, andSim

l

O are transitive under it.

A similar derivation applies to triples of the second kind. Once again, we resort to a similarity functionSimL definedbetween pairs of leaf links inLink. We will assume that, for every very leaf linkl, there exists

1. A similarity functionSimlO defined between pairs of objects inObj (i.e., a way to compare possible replacements of

the second component of the tuple).

2. A similarity functionSimlA defined between pairs of attributes in some primitive value set contained in the collection

Vals (i.e., a way to compare possible replacements of the third component of the tuple).

On the basis of these similarity functions we can now define

Sim2T

((l, o, v), (l′, o′, v′)) = SimL(l, l′) ∗�Siml

O(o, o′) ∗� Siml

O(v, v′) ,

where∗� is a T-norm such thatSimL, SimlO, andSiml

A are transitive under it.

17

5 Database editing

We are now in a condition to propose a database-editing methodology to compute the degree of matching between adatabase and an instantiation of a pattern. We will consider sequences of transformations of the triples in a database thatprogressively transform the database into a modified, edited, database that matches the pattern. Each transformation in thatsequence, which may include

1. Deletion of triples

2. Addition of triples

3. Modification of triples

has an associated degree of admissibility.

The degree of admissibility of a sequenceS = (E1, E2, . . . , En) ,

is defined as the composition of the degrees of admissibilities of the component editing transformations, i.e.,

Ad(S) = Ad(E1) ∗�Ad(E2) ∗� . . . ∗�Ad(En) ,

where the admissibility values are combined according to the following

Definition 2: Let [a, b] and [c, d] be subintervals of the[0, 1] interval of the real line. The T-norm combination of theseintervals is defined by

[a, b] ∗�[c, d] = [a ∗� c, b ∗� d] .

5.1 Degree of matching

Several transformations, or sequences of database editing operations, may result in the transformation of a database ordatamodelM into a transformed databaseM′. We may think of each such sequence as a path in a data-model space betweenthe original and the transformed database, Each path accomplishing the same transformation has an associated admissibilitymeasure that is a function of the admissibility of individual edits. From this perspective, it makes sense to measure theadmissibility of the a transformation in terms of the path having maximum admissibility.

Definition 3: Thedegree of matchingbetween two datamodelsM andM′ is the admissibility of the sequenceS trans-formingM intoM′ having maximum admissibility.13

It is important to note that, unlike classical similarity and distance metrics, the degree-of-matching function defined abovewill not be, in general, a symmetric function of its two arguments. The major reason for this lack of symmetry lies on thedifferent cost associated with editing operations that are inverse of each other (e.g., the cost of adding a triple to a datamodelM is not the same as that of deleting a triple fromM′).

5.2 Admissibility measures for basic edit operations

We discuss now admissibility measures for basic database-editing transformations. In our formulation, all costs (or ad-missibilities) depend on the nature of triples being modified. Whenever needed to introduce new triples, new objects will beadded to the database at no cost except that associated with the introduction of links in the database.

13We must note that admissibility function values, being intervals of[0, 1],, do not permit a straightforward definition of maximal admissibility. For themoment being, focusing on the important notion ofdegree of implication. we will regard a sequence as being optimal if the lower end of the correspondinginterval value of the admissibility for that sequence is larger or equal than the corresponding value for any other sequence accomplishing the same databasetransformation.

18

5.2.1 Addition of triples

The addition of triples of the form (Link, object, Object− set

),

maybe thought of as the replacement of relationUniv , which always holds between pairs of objects, by the more specificrelationLink . From such a perspective it is reasonable to define the admissibility of such transformation as

Ad1T (T ) = [ I(Link |, )Universal Π(Link,Universal ) ] ,

where the functionsI(. | .) andΠ(., .) are derived from a similarity function between links.

Correspondingly, the addition of triples of second kind(Link, object, Value− set

),

maybe thought of as the replacement of relationUniversal , which always holds between pairs of objects, by the morespecific relationLink . From such a perspective it is reasonable to define the admissibility of such transformation as

Ad2T (T ) = [ I(Universal | Link), Π(Universal , Link) ] ,

where the functionsI(. | .) andΠ(., .) are derived from a similarity function between links.

5.2.2 Deletion of triples

By the same rationale that led us to consider addition of tuples as replacement of the universal relationUniversal by amore specific relationLink , the process of deletion of tuples may be thought as the replacement ofLink by Universal .Since

I(Universal | Link) = 1 , Π(Universal , Link) = 1 ,

it is clear that any operation of deletion of triples has an admissibility equal to[1, 1] (i.e., it does not entail any cost).

5.2.3 Replacement of triples

In consideration of our previous remarks, we will measure the cost of replacing a triple of the first kind(L, o,O) by anothersuch tuple(L′, o′, O′) by the admissibility value

Ad1T (T ) = [ I((L′, o, O′) | (L, o,O)),Π((L′, o, O′), (L, o,O)) ] .

Correspondingly, we will measure the cost of replacing a triple of the second kind(L, o, V ) by another such tuple(L′, o′, V ′) by the admissibility value

Ad2T (T ) = [ I((L′, o, V ′) | (L, o, V )),Π((L′, o, V ′), (L, o, V )) ] .

6 Graphs and mappings

Graphs provide an effective way to visualize database editing operations. We sketch below the essentials of a theory ofgraph-editing transformations. Although similar in many respects, this approach is not formally equivalent to database-editingtechniques such as those proposed in this paper.

We prefer to use object or database editing as the foundation for our approach for several reasons:

1. Database editing techniques focus on links and their nature as the key element in the measurement of the degree ofmatching.

19

2. Graph-based techniques are limited as to their ability to permit rational derivation of measures of admissibility

3. Graph-based techniques have also representation handicaps when dealing with imprecise and uncertain databases

4. Database editing techniques adroitly address the issue of transforming a data model into another without need to recurto the intermediate notion of partial graph mapping.

We sketch now essential aspects of graph-editing methods. As was the case with our data models, we focus on repre-sentation of triples of the form(Attribute, Object, Value) , which permit expression of relational links with fullgenerality.

Definition 4: A graph is a 4-tupleG = (V,E, µ, ν) where

(i) V is a finite set ofverticesor nodes,

(ii) E, the set ofedges, is a subset ofV × V ,

(iii) µ : V → LV is a function assigning labels to the vertices,

(iv) ν : E → LE is a function assigning labels to the edges.

We proceed now to define the concept of mapping between two graphs:

Definition 5: A graph mapping between two graphsG = (V,E, µ, ν) andG′ = (V ′, E′, µ′, ν′) is a pairM = (ψ, ε)where

(i) ψ : V0 → V ′0 is a one-to-one mapping between some of the vertices of the two graphs, whereV0 ⊆ V andV ′

0 ⊆ V ′,

(ii) ε : E0 → E′0 is a one-to-one mapping between some of the edges of the two graphs, whereE0 ⊆ E andE′

0 ⊆ E′,

such that if two edges are mapped byε, then the nodes of those edges must be mapped byψ, that is, if e = (v1, v2) ande′ = (v′1, v

′2) are two edges such that ifε(e) = e′, then it should beψ(v1) = v′1 andψ(v2) = v′2.

We turn now our attention to the construction of mappings as the composition of sequence of primitive editing operations.

Graph edits and transformations We characterize next the notion of primitive graph edit:, discussing later how theseprimitive operations are combined.

Definition 6: An edit G → G′ is one of six types of mapping, defined below, between the graphG = (V,E, µ, ν) andanother graphG′ = (V ′, E′, µ′, ν′). The set of all edits is denoted byΣ.

The six allowable edit types are defined as follows:

(i) Delete Node: if v is unconnected inG,G−v−→ G′ : V ′ ∪ v = V,E = E′.

(ii) Add Node: G+v−→ G′ : V ∪ v = V ′, E = E′.

(iii) Delete Edge: G−e−→ G′ : V = V ′, E′ ∪ e = E.

(iv) Add Edge: G+e−→ G′ : V = V ′, E ∪ e = E′.

(v) Replace Node: G+v′−v−→ G′: V ∪ v′ = V ′ ∪ v, andE′ isE with every instance ofv in an edge inV replaced withv′.

(vi) Replace Edge: G+e′−e−→ G′: V = V ′, E ∪ e′ = E′ ∪ e.

20

It should be clear that each of these operations defines a simple mappingM between the graphsG andG′.

Definition 7: If → is an edit operationG = (V,E, µ, ν) → G′ = (V ′, E′, µ′, ν′), thenG is said to be thepredecessorofG, andG′ is said to be thesuccessorof G.

A transformation is a sequence of primitive editing operations:

Definition 8: A transformation T between two graphsG andG′ is a sequence of edits(→0, . . . ,→m) where

(i) G is the predecessor of→0, andG′ is the successor of→m.

(ii) If Gi is the successor of→i, thenGi is the predecessor of→i+1.

Clearly, the result of applying each of the edits in a transformation in sequence, i.e., thecompositionof all its primitiveedits, defines a mapping betweenG andG′ (since each primitive edit in the transformation sequence is itself a mappingbetween its predecessor and its successor).

On the basis of this observation we can formally characterize a one-to-many relationships between mappings and trans-formations:

Definition 9: A transformationT between two graphsG andG′ is said to beconsistentwith a mappingM = (ψ, ε) if andonly if the composition

→0 ◦ . . . ◦ →m ,

of the primitive edits inT is the mappingM .

6.1 Admissibility of mappings and transformations

In the original graph-editing formalism of Wolverton [Wol94] each edit operation is associated with a real number. Thisnumber models the cost associated with the replacement of the original graph by another in some analytical or computationalprocess (e.g., the costs associated with the assumption that a graphG matches a pattern when only its edited counterpartG′

does).

We prefer to employ, as we did in our previous discussions, an equivalent approach based on the dual notion ofadmissi-bility, adequacy, or acceptability of a transformation, which we will assume will be given by a number between0 and1. Thisnotion is the dual of that ofcostin the sense that

Low Cost ⇐⇒ High Admissibility

High Cost ⇐⇒ Low Admissibility

The admissibility function value (or simply theadmissibility) of a particular edit typically depends on both the edit typeand on the particular nodes and edges being edited. The related cost function used by Wolverton [Wol94], assigns a cost toAdd Node, Delete Node, Add Edge, and Delete Edge that depends only on the edit type while the cost of Replace Node andReplace Edge depend also on the classes of the nodes being edited.

A transformation may be thought of as a path in graph space connecting a graph with its transformed version. From sucha perspective, it makes sense to define the admissibility of a transformation in terms of some function that aggregates theadmissibility of each of its component edits. Several functions known astriangular norms(or T-norms) [RBP98] may beemployed to accomplish such aggregation:14

Definition 10: The admissibility of a transformation is the aggregation of the admissibility of its edits by means of atriangular norm∗�, i.e.

Ad(T ) = ∗�e∈T Ad(e)

14T-norms maybe thought of as being duals of the addition operator.

21

If we think of a transformation as a chain linking two graphsG andG′ in graph space, then aggregation of the admissibilityof individual edits by means of the minimum function corresponds to defining the strength of the chain as that of its weakestlink (application of the product T-norm defines, on the other hand, the admissibility of a transformation as the product of theadmissibility of the individual edits, thus requiring that all edits have a high degree of admissibility).

6.2 Degree of matching

Several transformations may be consistent with a mapping between two graphs. As we have remarked before, we maythink of each such transformation as a path between the graphs, having an associated admissibility measure that is a functionof the admissibility of individual edits. From this perspective, it makes sense to measure the admissibility of the mappingbetween two graphs in terms of the cost of the path having maximum admissibility. We give first formal definitions for thenotions of best-admissibility transformation of a mapping and of best-admissibility mapping between two graphs:

Definition 11: The most-admissible transformationof a mappingM is the transformation consistent withM that hashighest admissibility. Themost-admissible mappingbetween two graphsG andG′ is the mapping betweenG andG′

whose most-admissible transformation has the largest admissibility value.

Having introduced these concepts, we are now in a position to define a measure ofdegree of matchingbetween graphs that isindependent of either transformations or mappings.

Definition 12: Thedegree of matchingbetween two graphsG andG′ is the admissibility of the most-admissible mappingbetweenG andG′.

22

References

[BG94] Chitta Baral and Michael Gelfond. Logic programming and knowledge representation.Journal of Logic Program-ming, 19/20:73–148, 1994.

[BH93] Roland Backhouse and Paul Hoogendijk. Elements of a relational theory of datatypes. In Bernhard Moeller,Helmut Partsch, and Steve Schuman, editors,Formal Program Development, volume 755, pages 7–42. Springer-Verlag, New York, N.Y., 1993.

[BH01] Alexander Budanitsky and Graeme. Hirst. Semantic distance in wordnet: An experimental, application-orientedevaluation of five measures. InWorkshop on WordNet and Other Lexical Resources, Second meeting of the NorthAmerican Chapter of the Association for Computational Linguistics, Pittsburgh, PA, 2001.

[BP92] James C. Bezdek and Sankar K. Pal, editors.Fuzzy Models for Pattern Recognition: Methods that Search forStructures in Data. IEEE Press, 1992.

[Che76] P.P.-S. Chen. The entity-relationship model – toward a unified view of data.IACM Transactions on DatabaseSystems, pages 9–36, 1976.

[Col90] A. Colmerauer. An introduction to prolog-iii.Communications of the ACM, 330:69–90, 1990.

[Dat95] C. J. Date.Introduction to Database Systems. Addison-Wesley, sixth edition, 1995.

[DG00] Daniel J. Dougherty and Claudio Gutierrez. Normal forms and reduction for theories of binary relations. InRTA,pages 95–109, 2000.

[DP88] Didier Dubois and Henri Prade. An introduction to possibilistic and fuzzy logics. In Didier Dubois, Henri Prade,and P. Smets, editors,Non-Standard Logics for Automated Reasoning. Academic Press, 1988.

[McC60] John McCarthy. LISP 1 programmer’s manual. Technical report, Computation Center and Research Laboratory ofElectronics, MIT, Cambridge, Mass., 1960.

[Pra92] Vaughan R. Pratt. Origins of the calculus of binary relations. InLogic in Computer Science, pages 248–254, 1992.

[Rai68] H. Raiffa.Decision Analysis. Addison-Wesley, 1968.

[RBP98] E. H. Ruspini, P. P. Bonissone, and W. Pedrycz, editors.The Handbook of Fuzzy Computation. Institute of Physics,1998.

[RE98] E H. Ruspini and F. Esteva. Interpretations of fuzzy sets. In E. H. Ruspini, P. P. Bonissone, and W. Pedrycz,editors,The Handbook of Fuzzy Computation. Institute of Physics, 1998.

[RL02] Enrique H. Ruspini and John Lowrance. Semantic indexing and relevance measures in seas. Working Paper(DARPA GENOA Project, 2002.

[Rus91a] E. H. Ruspini. On the semantics of fuzzy logic.Int. J. of Approximate Reasoning, 5:45–88, 1991.

[Rus91b] E. H. Ruspini. Truth as utility: A conceptual synthesis. InProceedings of the Conference on Uncertainty inArtificial Intelligence, pages 316–322, Los Angeles, CA, 1991.

[Rus91c] Enrique H. Ruspini. On truth and utility. In Rudolf Kruse and Pierre Siegel, editors,Symbolic and QuantitativeApproaches for Uncertainty: Proceedings of the European Conference ECSQAU, Marseille, France, October1991, pages 297–304. Springer-Verlag, Berlin, 1991.

[SH81] L. Shapiro and R. Haralick. Structural descriptions and inexact matching.IEEE Transactions on Pattern Analysisand Machine Intelligence, 3:504–519, 1981.

[SKR95] A. Saffiotti, K. Konolige, and E.H. Ruspini. A multivalued logic approach to integrating planning and control.Artificial Intelligence, 76:481–526, 1995.

[Val85] L. Valverde. On the structure of F-indistinguishability operators.Fuzzy Sets and Systems, 17(3):313–328, 1985.

[Wol94] M. Wolverton.Retrieving Semantically Distant Analogies. PhD thesis, Stanford University, 1994.

23

database-editing metrics for pattern matchinglaw/triples3.pdf · repositories and of patterns in...

Documents