subject mediation for integrated access to heterogeneous information sources

56
Subject Mediation for Integrated Access to Heterogeneous Information Sources ADBIS’2001 L. A. Kalinichenko Institute of Informatics Problems Russian Academy of Science

Upload: suki

Post on 08-Jan-2016

14 views

Category:

Documents


0 download

DESCRIPTION

Subject Mediation for Integrated Access to Heterogeneous Information Sources. ADBIS’2001 L. A. K alinichenko Institute o f Informatics Problems Russian Academy of Science. Various forms of compositions are studied, e.g. : - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Subject Mediation for Integrated Access to Heterogeneous Information Sources

ADBIS’2001

L. A. Kalinichenko

Institute of Informatics Problems

Russian Academy of Science

Page 2: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Laboratory for compositional information systems development

Various forms of compositions are studied, e.g. :

• Interoperable compositions of pre-existing components for IS design;

• Compositions of heterogeneous information collections;

• Workflow compositions;

• Type compositions in database operations over object collections;

• Heterogeneous mediators compositions.

Web site of the group: http://www.ipi.ac.ru/synthesis/

Page 3: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Talk outline

• Subject Domain Mediation

• Mediators’ Projects: Brief Overview

• Query Planning Methods

• Infrastructure of the mediator aiming at

semantic interoperability of collections

• Summary

Page 4: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Subject Domain Mediation

Outline :

Objectives of information integration The mediator’s concept Mediator’s classes Consolidation of a mediator Advantages of the subject domain mediation approach

Page 5: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Web Search Engines

1 billion Web pages.

Search engines remain to be the main mechanism to access pages. Key words queries. Dozens of general purpose search engines, thousands of specialized engines (regional, thematic, corporal).

The following kinds of general purpose Web search engines can be distinguished:

• basic engines: AltaVista, HotBot, Infoseek, Lycos, WebCrawler, Yahoo, Rambler, Яndex, etc.• portals: Skworm, Proteus, Instantseek, etc.• metasearch engines: SavvySearch, Inference Find, ProFusion, etc.• metasearch utilities: Copernic, BeeLine, SearchPad, etc.

“Metasearch” engines provide for requesting several search engines and composing combined response. It is assumed that such response more probably will contain relevant information.

Precision of search is very low (uncontrollable use of terms for indexing and search). This is unavoidable payment for simplicity of home pages “registration” for the whole Web.

Page 6: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

What is the required level of information integration/dissemination

• Just putting information on the Web (creating a homepage, a Web site)

• Inserting a description of a resource into a suitable Digital Library (e.g, into NCSTRL, the Networked Computer Science Technical Report Library, a collection of institutional and archival CS research reports and papers)

• Using subject gateways for easier access to networked information resources in a defined subject area. Subject gateways work as intermediaries

• Applying a community-oriented digital library (a collection of documents built by a community of users which aims at observing or studying a phenomenon (e.g., in a context of a certain area)).

• Using heterogeneous multidatabase systems.

• Applying subject mediators to support representation and access to various subject domains. Mediators should provide modelling facilities and methods for conversion of unorganized, nonsystematic population of collections registered by different collection providers into a well-structured set of sources supported by the integrated uniform specifications. Metainformation. Systematic registration of collections.

Page 7: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

What and for what is to be integrated

What kind of information is to be supportedstructured, object, semi-structured, textual, multimedia

What kind of metainformation is neededthesauri, classifiers, vocabularies, ontologies, schema definitions (data, objects, functions, workflows)

What to disseminate:1. A document (paper) as a whole using additional document description2. A document in XML3. Content of a document

What to retrieve1. To discover individual resources (Web pages, documents, papers)2. To retrieve information relevant to a specific query contained in a collection of resources3. To retrieve information as workflows, methods and/or data and use various compositions of those4. To provide for interoperability of the information sources in process of problem solving:

-technical level of interoperability-semantic interoperability

Page 8: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Digital repositories of knowledge

Digital repositories of knowledge in certain areas can be implemented, like: Digital Earth, Digital Sky, Digital Bio, Digital Law, Digital Art, Digital Music. Examples of Microsoft TerraServer, Multi-Terabyte Astronomy Archives are widely known.

An example: DigiTerra (an Environmental Digital Library, Rutgers) objective is to provide continuous land monitoring, fire detection, water and air quality testing, urban planning, as well as supporting research and instructional activities in related areas of science. Vast array of environmental data collected in DigiTerra should include images from a variety of space-borne satellites, ground data from continuous monitoring weather stations, maps, reports and data sets from federal, state and local government agencies, and serve diverse user community.

Page 9: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

The Mediator Concept

The mediator architecture (Wiederhold, 1992) deals with the problem of integration of heterogeneous information. The sources are "heterogeneous" on many levels:

• data model and types of data used;• the underlying data units (salaries could be stored on a per-hour or per-month basis);• behavior of objects involved;• the underlying concepts. A payroll database may not regard a retiree as an employee, while the benefits department does. Conversely, the payroll department may include consultants in the concept "employee" while the benefits department does not;• the schema that the information may conform cannot be rigid in advance. Examples of "semi-structured" information include that found in XML documents, repositories used in the Human Genome Project, Lotus NOTES.

Mediator is to provide a uniform query interface to the multiple data sources, thereby freeing the user from having to locate the relevant sources, query each one in isolation, and combine manually the information from the different sources.

Page 10: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Mediation approaches

• integration information from pre-selected sources according to the predefined information needs. A procedural approach is known (TSIMMIS, Squirrel, WHIPS) to integrate information from sources through ad-hoc procedures. When information needs or sources change, a new mediator should be generated. This is known as Global as View (GAV) approach.

• integration information from arbitrary sources according to the predefined information needs. A declarative approach is known (Carnot, SIMS, Information Manifold, Infomaster). Mediators contain mechanisms to rewrite queries according to source descriptions. A rewritten query should be contained in the original query. This is known as Local as View (LAV) approach.

• combined LAV and GAV approaches (GLAV)

Page 11: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Mediator Definition as Subject Metainformation Consolidation

For the mediator's scalability two separate phases of the mediator's functioning are distinguished: consolidation phase and operational phase.

On the consolidation phase the efforts of the scientific community are focused on the mediator subject definition by declaring its metainformation. It is assumed that the top level researchers are involved in this process. The metainformation defined at the consolidation phase is assumed to be conservative for a certain period of time when it can only be extended. The well-known, representative collections of information in the subject domain are used during the process of metainformation definition. The metainformation created at the consolidation phase constitutes the federated level of the mediator.

During the operational phase arbitrary information collections can be registered at the mediator expressed in terms of the federated level. Process of the registration is autonomous and can be done by collection providers independently of each other. Users of the mediator know only the metainformation defining the mediator’s subject and formulate their queries in terms of the mediator’s subject. For a query the mediator decides what registered collections are relevant to the query.

Page 12: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Subject Mediator. Cultural Heritage Collections. Federated Level Metainformation

Person

Creator Collector Owner

Heritage_Entity

Painting Sculpture Antiquities

Repository

Museum Gallery Exhibition

created_by*date*narrative*idintifier*relation*…place_of_originhistory_periodcontentorigin_historyin_collectionowned_bydigital_form...

«type»

containsnearwithinfollows…

«type»Text

Thesauri:

Cultural Heritage

History

Jurisdiction

Page 13: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Subject Mediator. Cultural Heritage Collections. Collections Registration

Federated Level Metainformation

Local into Federated Level Mapping

CIMI Profile of z39.50 Louvre Museum Web Site Uffizi Museum Web Site

museum_object

created_by*date_collected*description*object_id*relation*…content_generalcollectionmrObject

creator_c

nationalityworks

department

namedescriptionsections

author

namenationalityworks

artist

namebiographypaint_list

canvas

titlepainterdatehistorydescriptionto_image

creator_c(c/Creator_Creator_Info [name, nationality, date_of_birth, date_of_death, works/{set_of:Heritage_Entity_Museum_Object}]) creator(c[name, nationality, date_of_birth, date_of death, works])

author (a/Creator_Author[name/fname, nationality, works/{set_of:Heritage_Entity_Work}]) creator(name, nationality, works (w)) & c,s ( repository (c/Collection [contains(s/Section)]) & repository.name = ‘Louvre’ & in (w, s.contains) )

artist(a/Creator_Artist[name, nationality, general_info/Text_Textual, works/{set_of: Painting_Canvas}]) creator(a[name, nationality, general_info, works]) & repository (n/name, collection) & n = ‘Uffizi’ & col/Collection ( isempty (intersect (collection(col/ Collection).contains, works)))

Local Views in Terms of Federated Classes

Page 14: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Mediator

Subject Mediator. Cultural Heritage Collections. Query Planning

Find digital images of Italian paintings of Renaissance containing a drawing of Madonna with a child

{i/Image | p/Painting, d/Digital_Entity, re/Rendition ( creator( nationality, works(p.digital_form(d).rendition(re).resource(i/Image)) & nationality = ‘Italy’ & p.content.contains(‘Madonna with a child') & p.history_period = ‘Renaissance’ }

QueryPlanner Thesaurus

Thesaurus extension may add ‘Virgin Mary’, ‘God Mather’

{i/Image | o/Heritage_Entity_Museum_Object, d/Digital_Object, re/Rendition (creator_c(nationality,works(o)) & nationality = ‘Italy’ & o.history_period = ‘Renaissance’ & o.content.contains(‘Madonna with a child OR …’) & in(i, o.digital_object(d).rendition(re).resource))}

{i/Image | w/Heritage_Entity_Work (author(nationality, works (w)) & nationality = ‘Italy’ & w.history_period = ‘Renaissance’ & w.description.contains(‘Madonna with a child OR … ') & in (i, w.to_image)}

{i/Image | r/Collection_Room, p/Painting_Canvas (artist(nationality, paint_list(p), room_list (r) ) & in(p, r.paint_list)) & nationality = ‘Italy’ & p.history_period = ‘Renaissance’ & p.description.contains(‘Madonna with a child OR … ') & in(i, p.to_image)

CIMI Louvre Uffizi

User

Page 15: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Advantages of subject domain mediation1. Subject mediation makes possible to reach semantic integration of heterogeneous information collections

2. Users should know only subject definitions that contain concepts, structures and methods as defined by the community

3. Information providers can disseminate their information for integration independently of each other and at any time. To disseminate they should register their information at the subject mediator. Users should not know anything about the registration activity.

4. Autonomous information collections contexts, data model and languages used, implementation platforms are absolutely independent on the mediator and its consolidated metainformation definitions

5. Querying the subject definitions, users have integrated access to all information registered at the mediators up to the moment of a query.

6. Mediators form recursive structure: each mediator can be registered at another mediator. Thus, multiple subjects can be semantically integrated defining mediators of the higher level.

7. Personalization providing convenient views for specific groups of users can be formed above the subject definitions. This process is independent of the existing collection and their registration.

Page 16: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Disadvantages of subject mediation

1. Providing a subject definition requires that a proper level of maturity and organization of scientific community have to be reached (e.g., are the research and development groups in the area sufficiently open, collaborative and motivated). Subject consolidation is a collective, organized effort of the community.

2. Process of registration is not an easy one and requires specific supporting tools.

Page 17: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Mediator’s Recursion

Querymediator

Data frommediator

Querycollection

Data fromcollection

Registercollection

Registermediator(as collection)

Mediator

Page 18: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Mediators’ Projects: Brief Overview

Outline :

TSIMMIS (Stanford) Information Manifold (Univ. of Washington) GARLIC (IBM) InfoSleuth (MCC) XML as a middleware model

Page 19: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

TSIMMIS (The Stanford-IBM Manager of Multiple Information Sources)

In TSIMMIS mediators are built above a GIVEN set of sources with wrappers that export OEM self-describing objects.

OEM (Object Exchange Model) is used as a unifying data model. The mediators considered provide integrated OEM views of the underlying information (e.g., if a relational source is considered, it is exported as a set of OEM objects.)

Mediators are specified with MSL (Mediator Specification Language) that can be seen as a view definition language and is a logic-based object-oriented language targeted to OEM. Variables in MSL may refer only to existing sets. In absence of negation MSL can be viewed as a variant of Datalog. A query consists of rules using <object-id label value> as patterns. To describe a mediator in MSL, one gives logical rules that define the OEM objects that the mediator makes available in a view.

Wrappers are specified with WSL that is an extension of MSL to allow for the description of source contents and querying capabilities

Page 20: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Information Manifold

In the Information Manifold a reasoning phase is required for realizing which sources have the data of interest, unlike TSIMMIS where view expansion is all that is needed for finding what data each source must contribute.

The user interacts with a uniform interface in the form of a set of global relations (the mediated schema) used in formulating queries. The actual data is stored in external source relations. To answer queries, a mapping between the relations in the mediated schema and the source relations must be specified. A method to specify these mappings is to describe each source relation as the result of a conjunctive query (i.e., a single Horn rule) over the relations in the mediated schema.

Given a user query formulated in terms of the relations in the mediated schema, the system must translate it to a query that mentions only the source relations and is a maximally contained plan. The collection of available data sources may not contain all the information needed to answer a query.

The Information Manifold provides uniform access to structured information sources on the WWW.

Page 21: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Source Query Capabilities Representations in Mediation Frameworks

Sources express their capabilities in mediation systems through a variety of mechanisms - query templates, capability records, and simple capability-description grammars.

Concerning query capabilities, data sources with different and limited capabilities are accessed either by writing rich functional wrappers for the more primitive sources, or by dealing with all sources at a ''lowest common denominator''. Another approach, in which a mediator ensures that sources receive queries they can handle, while still taking advantage of all the query power of the source.

Wrappers reflect the actual query capabilities of the underlying data sources, while the mediator has a general mechanism for interpreting those capabilities and forming execution strategies for queries. Capabilities-Based Rewriters (CBR) are basic mechanisms of the mediators to develop a plan for a query taking into account capabilities of the sources.

Page 22: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

The GARLIC Approach (IBM Almaden)

Heterogeneous and multimedia information systems are main objectives.

Only specific data types are supported in multimedia. For example, document retrieval through use of various text indexing and search, spatial searches in GIS, image processing (QBIC, Photobook). One of well-known decision is Illustra's datablades for different data types.

Garlic differs in that there is no intention to store everything in one repository - distribution, heterogeneity and integration of heterogeneous sources.

Conformance concept of interfaces (interface in a sense of ODMG-93) leads to an interface lattice based on a subtyping.

Garlic exploits specific wrapper technology based on source capability specification. Source capabilities are coded by the programmer within the corresponding wrapper. They remain unknown to the optimizer.

Page 23: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

InfoSleuth: semantic integration of information in open and dynamic environments

Integration of different technological developments in supporting mediated interoperation of data and services over information networks:

• Agent Technology. Specialized agents that represent the users, the information resources, and the system itself cooperate to address the system requirements of the users. Decentralization of capabilities is reached that is the key to system scalability and extensibility.

• Domain models (ontologies). Give a concise, uniform and declarative description of semantic information independent of the underlying models.

• Information Brokerage. Specialized information agents match information needs (specified in terms of some ontology) with currently available sources. So requests can be routed to the relevant sources.

• Internet computing. Java and Java Applets enable deployment of agents at any source of information regardless of its location or platform.

Page 24: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

YAT: XML as a middleware model An XML-oriented algebra having optimization properties in a combination with definition of query source capabilities, wrapping more structured query languages (e.g., OQL), new optimization technique for XML-based integration system.

Other semistructured/XML systems – TSIMMIS (query templates are used to describe source capabilities) and MIX. However, definition of all possible queries according to a schema is not feasible with such templates.

YAT operational model and algebra. XML data (like objects) can be arbitrarily nested. A technique similar to OO is adopted. For an arbitrary XML structure an operator Bind is applied whose function is to extract relevant information and produce a Tab structure (comparable to non 1NF relation). To these Tab structures classical operators like Join, Select, Project, etc. can be applied.

Bind operator: input tree, given filter (a tree with distinct variables). Produces a table that contains the variable bindings resulting from the pattern matching. It is expensive to evaluate, but it can be rewritten into more simpler operations.

Tab operator: applied to Tab structures and returns a collection of trees conforming to some input pattern.

Page 25: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Query Planning Methods for Mediators of Heterogeneous Information Sources

Outline :

Query Planning for LAV approach Query Containment Techniques Wrapper generation

Page 26: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Representation of Information Sources

Formally, the contents of an information source are described by a pair (or set of pairs) of the form (v, rv ) where v is a class name with mv state attributes, and rv is a formula of the form:

rv = U p1 (U 1) &…& pn ( Un )

The formula rv has mv distinguished variables. The pi 's are any of the classes on the federated level. The class name v is a new name describing an information source. This means that the source can be asked a query of the form v(Z) (or any partial instantiation of it), and returns instances with mv state attributes that satisfy the following implication:

Z (v( Z)) => rv(Z))

Simplified source capability model (input bindings, output, selections):

R1(Y1, ... , Yk):- R(X1, ... , Xm), 1 = a1, ... , n = an, = Y1, ... , k = Yk, 1, ... , h

Page 27: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Sound and Relevant Query Plans

A simplified query Q to the mediator can be represented as a conjunction:

Q(Y) : X p1 (X1) & … & pn (Xn );

X , X 1 , … , Xn are tuples of variables or constants and the pi 's are any of the classes on the federated level. The answer to the query is the set of bindings that can be obtained for the variables in Y.

Given a query of the form above, the query processor generates a set of conjunctive plans for answering Q(Y) as formulae of the form:

Q(Y): U v1 (U1) & … & vk (Uk ) & Cp

where each of the vi 's is a class name associated with an information source, and Cp is a conjunction of atoms of order relations. Note that the distinguished variables in the plan are the same as the ones in the query. Given a conjunctive plan P , the descriptions of the information sources imply that the following constraints hold on the answers it produces: (recall that rvi is the formula describing the constraints on the instances found in v i )

ConP : rv1 (U1) & … & rvk (Uk) & Cp

Page 28: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Sound and Relevant Query Plans

Definition: A conjunctive plan P is sound if all the answers it produces are guaranteed to be answers to the query, i.e., if the following entailment holds:

Y (ConP) => X p1(X1) & … & pn (Xn)

Several conjunctive plans to answer a query are required because the information sources are not complete.

Definition: A conjunctive plan P is relevant to a query Q(Y) : X p1(X1) &…& pn (Xn ) if the sentence Y,X (Conp & p1(X1) & … & pn(Xn)) is satisfiable.

Page 29: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Plan GenerationFirst step: separately for each subgoal in the query, compute which information sources are relevant to it and collect such sources into respective buckets. An information source is relevant to a subgoal g if, the description of the source contains a subgoal g1 that can be unified with g, such that after the unification, the constraints in the query and the constraints in the source description are mutually satisfiable.

‘Satisfiable’ means that the conjunction of built-in atoms should be satisfiable and there are no two subgoals C(x) and D(x) where C and D are disjoint classes. ‘Mutually satisfiable’ means that if C(Q) and C(U) are the conjunction of constraint subgoals in query and source, then C(Q) & C(U) should be satisfiable.

Second step: conjunctive plans constructed are analyzed by choosing one relevant source for every subgoal in the query, and check each plan for soundness and relevance. Specifically, it is considered every conjunctive plan Q1 of the form

Q1(Y) : ( U) v1(U1) & … & vn(Un)

where vi(Ui) has been deemed relevant to subgoal pi in the query. Each such conjunctive plan should be checked that it is (1) relevant, (2) sound (if it is not a sound plan, it is checked whether it can be made sound by adding conjuncts of order predicates), and (3) minimal (i.e., we cannot remove a subgoal from the plan and still obtain a sound plan).

Page 30: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Plan Generation

Usually these properties are checked using algorithms for containment of conjunctive queries. The algorithm should guarantee to produce only sound and relevant plans.

Whether the algorithm produces all the necessary conjunctive plans ? The answer is based on the close relationship between the problem of finding conjunctive plans and the problem of answering queries using materialized views.

The cost of checking minimality and soundness of a conjunctive plan is exponential, it is exponential only in the size of the query, which tends to be small, and not in the number of information sources or their contents.

Page 31: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Query Containment Algorithms

• Basic techniques (e.g., QinP (Ullman): Containment of conjunctive queries in logical recursions, negation in conjunctive queries by Chan)

• Extensions: 

1. Containment for queries with complex objects. Typing constraints and integrity constraints for object DB schemas

2. Relative containment

3. Conjunctive queries with regular expressions Query containment under constraints

4. Bag containment of conjunctive queries

• Alternative techniques 

1. Counter machines to study query containment

2. Verification of knowledge bases

3. Description Logics

Page 32: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Containment of Conjunctive Queries in Logical Recursions (QinP)

An algorithm testing whether a conjunctive query is contained in the relation defined by a logic program.

Given are a conjunctive query Q, represented as:

H :- G1 & … & Gk and a logic program P.

To decide whether Q P:

1) Assign to every variable in Q a unique constant.2) Form EDB relation from the subgoals of Q.3) Evaluate P (bottom-up) as DB relation4) If EDB is contained in DB then Q P

Page 33: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

A Query Converter for Wrappers Toolkit

In Tsimmis query converter is a part of the Wrapper implementation toolkit. MSL logic-based, OEM-oriented query language is used.

Source capabilities are defined with templates in a Query Description and Translation Language (QDTL). Each template can be associated with an action that generates the commands for the underlying source.

The converter will process:

Directly supported queries. These are queries that syntactically match a template. Logically supported queries. These are queries that produce the same results as a directly supported query. The notion of logical equivalence is used to detect queries that fall in this class. Indirectly supported queries. These are queries that can be executed in two steps: first a directly supported query is executed, and then a filter is applied to the results of the first step.

Page 34: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Detection of maximal supporting query and of a filter

A query qs is a maximal supporting query of query q with respect to capability

description if qs is directly supported by d, qs indirectly supports q1, and there is no

directly supported query q’s that indirectly supports q1 , is subsumed by qs, and is not

logically equivalent to qs There may be more than one maximal supporting query for a given query.

Capability description D is expressed as a (possibly recursive) Datalog program.The problem of determining if a description D supports query Q, is the same as the problem of determining if program P(D) contains (subsumes) query Q and if a corresponding filter query exists. A supporting query is found in two steps:

1. find a subsuming query, and2. find the corresponding filter.

The approach is based on the extended Ullman query containment algorithm (X-QinP) that gives yes/no answer to the containment question.

The algorithm is extended to find the actual maximal supporting queries and also the native query constituents for the underlying source.

Page 35: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Known modifications of query rewritingalgorithms using views

 

1. Conjunctive queries

2. Source templates with binding patterns

3. Recursive queries

4. Views in description logic

5. Rewriting for semistructured data. Regular expressions rewriting, navigational plans

6. Boolean queries rewriting

7. Queries with union and aggregation

8. Type inferencing

9. Object fusion

10. Scalable technique

Page 36: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Infrastructure of the mediator aiming at semantic interoperability of collections

Outline :

• Heterogeneity of the mediator Canonical information model Mediator’s metadata Information extraction framework Collection registration at a mediator as a process of compositional development

Page 37: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Heterogeneous information models absorbed by the canonical model

Core

Canonical Model

Extensions

Component Models(IDL, CDL, BOF)

Object & HeterogeneousDB Models

(ODL, SQL3, Garlic)

Knowledge BaseRepresentations

(OKBC, Ontolingua)

Unstructured Data(vocabularies, thesauri)

Semistructured Data Models

(OEM, ADM, OQL-doc)

Document ObjectModel

Metadata for DL(Dublin Core, Warwick,

Starts, Z.39.50)

Metadata Expressiblein Meta Models

(MOF, RDF)

is_refined_by

Workflow Models

Page 38: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Canonical Model Entities

Metaclass

Class Type

Object Frame

Collection World

instance_of

superclass supertype

Abstract Value

type

instance_ofinstance_of instance_of

becomes an object

type

instance type

typeinstance_of

instance_of

typeinstance type

instance instance typeinstance_ofinstance_of

Page 39: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Canonical Information Model

A set of the canonical model facilities used for the uniform representation of the information resources includes the following:

• Frame representation facilities. Frames are treated as a special kind of abstract values introduced mostly for description of concepts, terminological and weakly-structured information. All specifications in canonical model have a form of frames that become a part of the metabase.

• Unifying type system. A universal constructor of arbitrary abstract data types as well as a comprehensive collection of the built-in types are included into a type system.

• Class representation. Classes provide for representing of sets of homogeneous entities of an application domain. Class instances (objects) have specific types.

• Multiactivity (workflow) representation. These are used for the specification and implementation of interconnected and interdependent application activities, for the specificaton of declarative assertions and concurrent megaprograms over the information resources.

• Facilities for the logical formulae expressions. A multisorted object calculus (typed first-order language) is used for querying the integrated set of digital collections as well as for specification of constraints and behaviour.

Page 40: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Mediator’s Metadata Layering

Structured Collection

Schema Semistructured Collection Unstructured Collection

Vocabulary/Thesauri

Schema Schema VocabularyVocabulary

Thesauri

Federated SchemaCommon Thesauri

Core Extension

Subject Classification Hierarchy & Context(metaclass hierarchy & ontological definitions)

Specific VocabularyPersonalizedDL Level

Interoperable(Federated)Level

LocalLevel

Real CollectionLevel

Ontology Ontology Ontology

Views Subschemas

Page 41: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Information Extraction Framework

XML data system Z39.50 serverinformation

retrieval systemmolecular biology

data banks

XMLwrapper

Z39.50wrapper

information retrievalsystem wrapper

SRS wrapper

http Z39.50 IIOP http

Mediator’s DBMS(object-relational DBMS)

Query Engine•canonical mediator’s query language•best relevant collection identification•query decomposition•query planning and monitoring

Graphical QueryFacilities

Outcome Presentation

Java / CORBA

Personalized DLCanonical GUI Personalized DLPersonalizationFacilities

InformationExtractionFacilities

LocalizationFacilities

LocalCollections

•ranking•merging•aggregation•summarization

metadatarepository

data

Page 42: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Metainformation Repository

Value

Frame Slot

AttributeModule Type Class

Schema ADT

CEntityFunction Concept

Reduct CompType

View CategoryMetaclass

Simulating

slots

*1

instances

*

1

instType1 *

instInstType1

*

simulatings*

1

Page 43: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Collection Registration Framework

The framework facilities are intended to support functions of collection contextualizing:

• constructing mapping of a collection data model and metadata into the canonical ones;

• representation of the new metainformation in terms of the federated mediator's level;

• inferring from the collection the required information for the federated level;

• semi-automatic construction of a collection wrapper;

• connecting the wrapper to the interoperation environment (e.g., CORBA).

Page 44: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Contextualization of Ontology

• mapping of local ontological context to that of the mediator– by names and relationships

– by natural language description

– applying structural integration to concept specifications

– introducing new concepts over existing ones

• contextualization through structural correlation– establishing weak ontological relevance of specification elements

applying analysis of intercontext concept relationships

– establishing tight ontological relevance of specification elements introducing a subsumption relationship between concepts

Page 45: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Correlation of Ontological Concepts

• evaluation of descriptor weights

• establishing intercontext relationships between concepts

t

VkYk

t

VkXk

t

VVkYkXk

YX

YX

WW

WW

YXsim22

,

XVi ii

kk

Xk

nN

f

nN

f

W2

log

log

t

VkXk

t

VVkYkXk

X

YX

W

WWX,Yr

2

,min

t

VkYk

t

VVkYkXk

Y

YX

W

WWY,Xr

2

,min

Page 46: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Ontological Metainformation

Class ADT

-code: string

Category

-definition: string-wordClass: string

Concept

type

1*

-weight: float-name: string

Descriptor

descriptors

descriptorOf

*

1

-strength: float=1

ConceptRel

-weight: float-frequency: float-name: string

ConceptWeight

fromRelationtoConcept

*1

toRelation

fromConcept *1

weights

weightOf

*

1foreign*

*

collection1

*concept 1

*

*

1 category

PositiveRel

NarrowRel

PartRel

RelativeRel

Page 47: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Process of an Information Source Registration

For each source class the following steps (of the compositional development process) are required [LNCS 2151]:

1. relevant federated classes identification

• Find federated classes that ontologically can be used for defining source class extent in terms of federated classes. To a source class several federated classes may correspond covering with their instance types different reducts of an instance type of the source class. On another hand, several source classes may correspond to one federated class.

2. most common reducts construction

For an instance type of each identified federated class do:

• Construct most common reducts for instance type of this federated class and source class instance type to concretize (partially) such federated instance type. Most common reduct may include also additional attributes corresponding to those federated type attributes that can be derived from the source type instances to support them.

• In this process for each attribute type of the common reduct a concretizing type, concretizing function or their combination should be constructed (this step should be recursively applied).

Page 48: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Process of an Information Source Registration

For each source class the following steps are required:

3. partial source view construction

• For each relevant federated class construct a partial source view expressing a constraints in terms of the federated class that should be satisfied by values of respective most common reducts of source class instances. Thus partial views over all relevant federated classes will be obtained.

4. partial views composition

• Construct compositions of the source type most common reducts obtained for instance types of all federated classes involved.

• Construct a source view as a composition of partial views obtained above. This is an expression of a materialized view of an information source in terms of federated classes. An instance type of this view is determined by the most common reducts composition constructed above.

Page 49: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Subject Mediator. Cultural Heritage Collections. Collections Registration

Federated Level Metainformation

Local into Federated Level Mapping

CIMI Profile of z39.50 Louvre Museum Web Site Uffizi Museum Web Site

museum_object

created_by*date_collected*description*object_id*relation*…content_generalcollectionmrObject

creator_c

nationalityworks

department

namedescriptionsections

author

namenationalityworks

artist

namebiographypaint_list

canvas

titlepainterdatehistorydescriptionto_image

creator_c(c/Creator_Creator_Info [name, nationality, date_of_birth, date_of_death, works/{set_of:Heritage_Entity_Museum_Object}]) creator(c[name, nationality, date_of_birth, date_of death, works])

author (a/Creator_Author[name/fname, nationality, works/{set_of:Heritage_Entity_Work}]) creator(name, nationality, works (w)) & c,s ( repository (c/Collection [contains(s/Section)]) & repository.name = ‘Louvre’ & in (w, s.contains) )

artist(a/Creator_Artist[name, nationality, general_info/Text_Textual, works/{set_of: Painting_Canvas}]) creator(a[name, nationality, general_info, works]) & repository (n/name, collection) & n = ‘Uffizi’ & col/Collection ( isempty (intersect (collection(col/ Collection).contains, works)))

Local Views in Terms of Federated Classes

Page 50: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Specifications of Types of the Uffizi Site Schema

-name: string

Repository

-name: string-biography: Textual

Artist

-title: Textual-painter: string-culture: Textual-date: time-description: Textual

Canvas

-room_no: string-room_name: Textual

Room

Image

{ordered}1

authors

{ordered}

1

contains

{ordered}1

paint_list

{ordered}

1 paint_list{ordered}

1

room_list

11 to_image

Page 51: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Specifications of Types of the Federated Schema

-name: string-nationality: string-date_of_birth: time-date_of_death: time-residence: Address

Person

-culture_race: string-general_info: Text

Creator

-title: Text-date: time-narrative: Text

Entity

created_by1

1

-place_of_origin: Address-date_of_origin: time-content: Text

Heritage_Entity

-dimensions: {sequence; type_of_element: integer}

Painting

-type_spicemen: Text-archeology: Text

Antiquities

-name: string-place: Address-description: Text

Repository

-name: Text-location: Address-description: Text

Collectioncollections

in_repository *

1

1 *

works

in_collection

contains

1

*

Digital_Entity1 1

digital_form

Page 52: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Most Common Reduct (Example){CR_Painting_Canvas;

in: c_reduct;

metaslot

of: Canvas;

taking: {title, painter, date, description, to_image};

reduct: R_Painting_Canvas

end;

simulating: {

R_Painting_Canvas.title ~ CR_Painting_Canvas.title;

R_Painting_Canvas.created_by ~

CR_Painting_Canvas.get_created_by;

R_Painting_Canvas.date_of_origin ~ CR_Painting_Canvas.date;

... };

get_created_by: {in: function;

params: {+ext/CR_Painting_Canvas, -returns/Creator};

predicative: {ex c/Canvas ((c/CR_Painting_Canvas = ext) &

ex a/Artist ((c.painter = a.name) &

returns = a/CR_Creator_Artist)))}}

...

}

Page 53: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Partial Source View Construction (Example)

The formula expressing the local class canvas is terms of the federated class painting is defined as:

canvas(p/CR_Painting_Canvas) painting(p/R_Painting_Canvas) & p.in_collection.in_repository = 'Uffizi‘

Specification of a class (actually, this is local as view class) containing this formula is:

{v_canvas_painting; in: class; class_section: { key: invariant, {unique; {title}}; lav: invariant, {subseteq (v_canvas_painting(p), painting(p/R_Painting_Canvas) & p.in_collection.in_repository = 'Uffizi')} }; instance_section: CR_Painting_Canvas}

Page 54: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Source View Composition (Example)

A final formula for a local class canvas in terms of the federated classes painting and creator is:

canvas(p/CR_Painting_Creator_Canvas)

painting(p/R_Painting_Canvas) & p.in_collection.in_repository = 'Uffizi' &

creator(c/R_Creator_Canvas) & w/Painting (in(w, c.works) &

w.in_collection.in_repository = 'Uffizi')

Complete definition of source view looks as follows:

{v_canvas;

in: class;

class_section: {

key: invariant, {unique; {title}};

lav: invariant, {subseteq(v_canvas,

painting(p/R_Painting_Canvas) &

p.in_collection.in_repository = 'Uffizi' &

creator(c/R_Creator_Canvas) & ex w/Painting

in(w,c.works)& w.in_collection.in_repository = 'Uffizi')})

};

instance_section: CR_Painting_Creator_Canvas

}

CR_Painting_Creator_Canvas = CR_Painting_Canvas ⌴ CR_Creator_Canvas

Page 55: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Structure of the Collection Registration Tool

source collection context / mediator metadata reconciliation

construction source class specifications as views over federated classes

most common reduct identification

Collection Registration Tool Mediator’s DBMS (Oracle 8i)

wrapper generation

B-Toolkit

metainformationrepository

B-AMN

wrapper code

Page 56: Subject Mediation for Integrated Access to Heterogeneous Information  Sources

Summary

• Subject domain mediation has good perspectives for heterogeneous information sources integration in process of formation of professional communities around Internet

• ‘Local as view’ approach looks promising for the worlds of multiple dynamically changing sources (content, availability) providing also for mediator’s scalability

• Widely known mediator projects and related researches contributed a lot to mediator definition, query planning, source capability description and wrapper generation

• Many serious gaps remain, e.g., mostly relational models were studied, conjunctive queries were supported, thesauri and ontologies have not been sufficiently involved, query containment were studied for precise queries (querying of textual, multimedia, object and semistructured data may require reconsideration), problem of source view registration for LAV approach had not been studied, mediator composition problems have not been investigated

• Therefore, the area looks fruitful for research , experimentation and development.