subject mediation for integrated access to heterogeneous information sources
DESCRIPTION
Subject Mediation for Integrated Access to Heterogeneous Information Sources. ADBIS’2001 L. A. K alinichenko Institute o f Informatics Problems Russian Academy of Science. Various forms of compositions are studied, e.g. : - PowerPoint PPT PresentationTRANSCRIPT
Subject Mediation for Integrated Access to Heterogeneous Information Sources
ADBIS’2001
L. A. Kalinichenko
Institute of Informatics Problems
Russian Academy of Science
Laboratory for compositional information systems development
Various forms of compositions are studied, e.g. :
• Interoperable compositions of pre-existing components for IS design;
• Compositions of heterogeneous information collections;
• Workflow compositions;
• Type compositions in database operations over object collections;
• Heterogeneous mediators compositions.
Web site of the group: http://www.ipi.ac.ru/synthesis/
Talk outline
• Subject Domain Mediation
• Mediators’ Projects: Brief Overview
• Query Planning Methods
• Infrastructure of the mediator aiming at
semantic interoperability of collections
• Summary
Subject Domain Mediation
Outline :
Objectives of information integration The mediator’s concept Mediator’s classes Consolidation of a mediator Advantages of the subject domain mediation approach
Web Search Engines
1 billion Web pages.
Search engines remain to be the main mechanism to access pages. Key words queries. Dozens of general purpose search engines, thousands of specialized engines (regional, thematic, corporal).
The following kinds of general purpose Web search engines can be distinguished:
• basic engines: AltaVista, HotBot, Infoseek, Lycos, WebCrawler, Yahoo, Rambler, Яndex, etc.• portals: Skworm, Proteus, Instantseek, etc.• metasearch engines: SavvySearch, Inference Find, ProFusion, etc.• metasearch utilities: Copernic, BeeLine, SearchPad, etc.
“Metasearch” engines provide for requesting several search engines and composing combined response. It is assumed that such response more probably will contain relevant information.
Precision of search is very low (uncontrollable use of terms for indexing and search). This is unavoidable payment for simplicity of home pages “registration” for the whole Web.
What is the required level of information integration/dissemination
• Just putting information on the Web (creating a homepage, a Web site)
• Inserting a description of a resource into a suitable Digital Library (e.g, into NCSTRL, the Networked Computer Science Technical Report Library, a collection of institutional and archival CS research reports and papers)
• Using subject gateways for easier access to networked information resources in a defined subject area. Subject gateways work as intermediaries
• Applying a community-oriented digital library (a collection of documents built by a community of users which aims at observing or studying a phenomenon (e.g., in a context of a certain area)).
• Using heterogeneous multidatabase systems.
• Applying subject mediators to support representation and access to various subject domains. Mediators should provide modelling facilities and methods for conversion of unorganized, nonsystematic population of collections registered by different collection providers into a well-structured set of sources supported by the integrated uniform specifications. Metainformation. Systematic registration of collections.
What and for what is to be integrated
What kind of information is to be supportedstructured, object, semi-structured, textual, multimedia
What kind of metainformation is neededthesauri, classifiers, vocabularies, ontologies, schema definitions (data, objects, functions, workflows)
What to disseminate:1. A document (paper) as a whole using additional document description2. A document in XML3. Content of a document
What to retrieve1. To discover individual resources (Web pages, documents, papers)2. To retrieve information relevant to a specific query contained in a collection of resources3. To retrieve information as workflows, methods and/or data and use various compositions of those4. To provide for interoperability of the information sources in process of problem solving:
-technical level of interoperability-semantic interoperability
Digital repositories of knowledge
Digital repositories of knowledge in certain areas can be implemented, like: Digital Earth, Digital Sky, Digital Bio, Digital Law, Digital Art, Digital Music. Examples of Microsoft TerraServer, Multi-Terabyte Astronomy Archives are widely known.
An example: DigiTerra (an Environmental Digital Library, Rutgers) objective is to provide continuous land monitoring, fire detection, water and air quality testing, urban planning, as well as supporting research and instructional activities in related areas of science. Vast array of environmental data collected in DigiTerra should include images from a variety of space-borne satellites, ground data from continuous monitoring weather stations, maps, reports and data sets from federal, state and local government agencies, and serve diverse user community.
The Mediator Concept
The mediator architecture (Wiederhold, 1992) deals with the problem of integration of heterogeneous information. The sources are "heterogeneous" on many levels:
• data model and types of data used;• the underlying data units (salaries could be stored on a per-hour or per-month basis);• behavior of objects involved;• the underlying concepts. A payroll database may not regard a retiree as an employee, while the benefits department does. Conversely, the payroll department may include consultants in the concept "employee" while the benefits department does not;• the schema that the information may conform cannot be rigid in advance. Examples of "semi-structured" information include that found in XML documents, repositories used in the Human Genome Project, Lotus NOTES.
Mediator is to provide a uniform query interface to the multiple data sources, thereby freeing the user from having to locate the relevant sources, query each one in isolation, and combine manually the information from the different sources.
Mediation approaches
• integration information from pre-selected sources according to the predefined information needs. A procedural approach is known (TSIMMIS, Squirrel, WHIPS) to integrate information from sources through ad-hoc procedures. When information needs or sources change, a new mediator should be generated. This is known as Global as View (GAV) approach.
• integration information from arbitrary sources according to the predefined information needs. A declarative approach is known (Carnot, SIMS, Information Manifold, Infomaster). Mediators contain mechanisms to rewrite queries according to source descriptions. A rewritten query should be contained in the original query. This is known as Local as View (LAV) approach.
• combined LAV and GAV approaches (GLAV)
Mediator Definition as Subject Metainformation Consolidation
For the mediator's scalability two separate phases of the mediator's functioning are distinguished: consolidation phase and operational phase.
On the consolidation phase the efforts of the scientific community are focused on the mediator subject definition by declaring its metainformation. It is assumed that the top level researchers are involved in this process. The metainformation defined at the consolidation phase is assumed to be conservative for a certain period of time when it can only be extended. The well-known, representative collections of information in the subject domain are used during the process of metainformation definition. The metainformation created at the consolidation phase constitutes the federated level of the mediator.
During the operational phase arbitrary information collections can be registered at the mediator expressed in terms of the federated level. Process of the registration is autonomous and can be done by collection providers independently of each other. Users of the mediator know only the metainformation defining the mediator’s subject and formulate their queries in terms of the mediator’s subject. For a query the mediator decides what registered collections are relevant to the query.
Subject Mediator. Cultural Heritage Collections. Federated Level Metainformation
Person
Creator Collector Owner
Heritage_Entity
Painting Sculpture Antiquities
Repository
Museum Gallery Exhibition
created_by*date*narrative*idintifier*relation*…place_of_originhistory_periodcontentorigin_historyin_collectionowned_bydigital_form...
«type»
containsnearwithinfollows…
«type»Text
Thesauri:
Cultural Heritage
History
Jurisdiction
Subject Mediator. Cultural Heritage Collections. Collections Registration
Federated Level Metainformation
Local into Federated Level Mapping
CIMI Profile of z39.50 Louvre Museum Web Site Uffizi Museum Web Site
museum_object
created_by*date_collected*description*object_id*relation*…content_generalcollectionmrObject
creator_c
nationalityworks
department
namedescriptionsections
author
namenationalityworks
artist
namebiographypaint_list
canvas
titlepainterdatehistorydescriptionto_image
creator_c(c/Creator_Creator_Info [name, nationality, date_of_birth, date_of_death, works/{set_of:Heritage_Entity_Museum_Object}]) creator(c[name, nationality, date_of_birth, date_of death, works])
author (a/Creator_Author[name/fname, nationality, works/{set_of:Heritage_Entity_Work}]) creator(name, nationality, works (w)) & c,s ( repository (c/Collection [contains(s/Section)]) & repository.name = ‘Louvre’ & in (w, s.contains) )
artist(a/Creator_Artist[name, nationality, general_info/Text_Textual, works/{set_of: Painting_Canvas}]) creator(a[name, nationality, general_info, works]) & repository (n/name, collection) & n = ‘Uffizi’ & col/Collection ( isempty (intersect (collection(col/ Collection).contains, works)))
Local Views in Terms of Federated Classes
Mediator
Subject Mediator. Cultural Heritage Collections. Query Planning
Find digital images of Italian paintings of Renaissance containing a drawing of Madonna with a child
{i/Image | p/Painting, d/Digital_Entity, re/Rendition ( creator( nationality, works(p.digital_form(d).rendition(re).resource(i/Image)) & nationality = ‘Italy’ & p.content.contains(‘Madonna with a child') & p.history_period = ‘Renaissance’ }
QueryPlanner Thesaurus
Thesaurus extension may add ‘Virgin Mary’, ‘God Mather’
{i/Image | o/Heritage_Entity_Museum_Object, d/Digital_Object, re/Rendition (creator_c(nationality,works(o)) & nationality = ‘Italy’ & o.history_period = ‘Renaissance’ & o.content.contains(‘Madonna with a child OR …’) & in(i, o.digital_object(d).rendition(re).resource))}
{i/Image | w/Heritage_Entity_Work (author(nationality, works (w)) & nationality = ‘Italy’ & w.history_period = ‘Renaissance’ & w.description.contains(‘Madonna with a child OR … ') & in (i, w.to_image)}
{i/Image | r/Collection_Room, p/Painting_Canvas (artist(nationality, paint_list(p), room_list (r) ) & in(p, r.paint_list)) & nationality = ‘Italy’ & p.history_period = ‘Renaissance’ & p.description.contains(‘Madonna with a child OR … ') & in(i, p.to_image)
CIMI Louvre Uffizi
User
Advantages of subject domain mediation1. Subject mediation makes possible to reach semantic integration of heterogeneous information collections
2. Users should know only subject definitions that contain concepts, structures and methods as defined by the community
3. Information providers can disseminate their information for integration independently of each other and at any time. To disseminate they should register their information at the subject mediator. Users should not know anything about the registration activity.
4. Autonomous information collections contexts, data model and languages used, implementation platforms are absolutely independent on the mediator and its consolidated metainformation definitions
5. Querying the subject definitions, users have integrated access to all information registered at the mediators up to the moment of a query.
6. Mediators form recursive structure: each mediator can be registered at another mediator. Thus, multiple subjects can be semantically integrated defining mediators of the higher level.
7. Personalization providing convenient views for specific groups of users can be formed above the subject definitions. This process is independent of the existing collection and their registration.
Disadvantages of subject mediation
1. Providing a subject definition requires that a proper level of maturity and organization of scientific community have to be reached (e.g., are the research and development groups in the area sufficiently open, collaborative and motivated). Subject consolidation is a collective, organized effort of the community.
2. Process of registration is not an easy one and requires specific supporting tools.
Mediator’s Recursion
Querymediator
Data frommediator
Querycollection
Data fromcollection
Registercollection
Registermediator(as collection)
Mediator
Mediators’ Projects: Brief Overview
Outline :
TSIMMIS (Stanford) Information Manifold (Univ. of Washington) GARLIC (IBM) InfoSleuth (MCC) XML as a middleware model
TSIMMIS (The Stanford-IBM Manager of Multiple Information Sources)
In TSIMMIS mediators are built above a GIVEN set of sources with wrappers that export OEM self-describing objects.
OEM (Object Exchange Model) is used as a unifying data model. The mediators considered provide integrated OEM views of the underlying information (e.g., if a relational source is considered, it is exported as a set of OEM objects.)
Mediators are specified with MSL (Mediator Specification Language) that can be seen as a view definition language and is a logic-based object-oriented language targeted to OEM. Variables in MSL may refer only to existing sets. In absence of negation MSL can be viewed as a variant of Datalog. A query consists of rules using <object-id label value> as patterns. To describe a mediator in MSL, one gives logical rules that define the OEM objects that the mediator makes available in a view.
Wrappers are specified with WSL that is an extension of MSL to allow for the description of source contents and querying capabilities
Information Manifold
In the Information Manifold a reasoning phase is required for realizing which sources have the data of interest, unlike TSIMMIS where view expansion is all that is needed for finding what data each source must contribute.
The user interacts with a uniform interface in the form of a set of global relations (the mediated schema) used in formulating queries. The actual data is stored in external source relations. To answer queries, a mapping between the relations in the mediated schema and the source relations must be specified. A method to specify these mappings is to describe each source relation as the result of a conjunctive query (i.e., a single Horn rule) over the relations in the mediated schema.
Given a user query formulated in terms of the relations in the mediated schema, the system must translate it to a query that mentions only the source relations and is a maximally contained plan. The collection of available data sources may not contain all the information needed to answer a query.
The Information Manifold provides uniform access to structured information sources on the WWW.
Source Query Capabilities Representations in Mediation Frameworks
Sources express their capabilities in mediation systems through a variety of mechanisms - query templates, capability records, and simple capability-description grammars.
Concerning query capabilities, data sources with different and limited capabilities are accessed either by writing rich functional wrappers for the more primitive sources, or by dealing with all sources at a ''lowest common denominator''. Another approach, in which a mediator ensures that sources receive queries they can handle, while still taking advantage of all the query power of the source.
Wrappers reflect the actual query capabilities of the underlying data sources, while the mediator has a general mechanism for interpreting those capabilities and forming execution strategies for queries. Capabilities-Based Rewriters (CBR) are basic mechanisms of the mediators to develop a plan for a query taking into account capabilities of the sources.
The GARLIC Approach (IBM Almaden)
Heterogeneous and multimedia information systems are main objectives.
Only specific data types are supported in multimedia. For example, document retrieval through use of various text indexing and search, spatial searches in GIS, image processing (QBIC, Photobook). One of well-known decision is Illustra's datablades for different data types.
Garlic differs in that there is no intention to store everything in one repository - distribution, heterogeneity and integration of heterogeneous sources.
Conformance concept of interfaces (interface in a sense of ODMG-93) leads to an interface lattice based on a subtyping.
Garlic exploits specific wrapper technology based on source capability specification. Source capabilities are coded by the programmer within the corresponding wrapper. They remain unknown to the optimizer.
InfoSleuth: semantic integration of information in open and dynamic environments
Integration of different technological developments in supporting mediated interoperation of data and services over information networks:
• Agent Technology. Specialized agents that represent the users, the information resources, and the system itself cooperate to address the system requirements of the users. Decentralization of capabilities is reached that is the key to system scalability and extensibility.
• Domain models (ontologies). Give a concise, uniform and declarative description of semantic information independent of the underlying models.
• Information Brokerage. Specialized information agents match information needs (specified in terms of some ontology) with currently available sources. So requests can be routed to the relevant sources.
• Internet computing. Java and Java Applets enable deployment of agents at any source of information regardless of its location or platform.
YAT: XML as a middleware model An XML-oriented algebra having optimization properties in a combination with definition of query source capabilities, wrapping more structured query languages (e.g., OQL), new optimization technique for XML-based integration system.
Other semistructured/XML systems – TSIMMIS (query templates are used to describe source capabilities) and MIX. However, definition of all possible queries according to a schema is not feasible with such templates.
YAT operational model and algebra. XML data (like objects) can be arbitrarily nested. A technique similar to OO is adopted. For an arbitrary XML structure an operator Bind is applied whose function is to extract relevant information and produce a Tab structure (comparable to non 1NF relation). To these Tab structures classical operators like Join, Select, Project, etc. can be applied.
Bind operator: input tree, given filter (a tree with distinct variables). Produces a table that contains the variable bindings resulting from the pattern matching. It is expensive to evaluate, but it can be rewritten into more simpler operations.
Tab operator: applied to Tab structures and returns a collection of trees conforming to some input pattern.
Query Planning Methods for Mediators of Heterogeneous Information Sources
Outline :
Query Planning for LAV approach Query Containment Techniques Wrapper generation
Representation of Information Sources
Formally, the contents of an information source are described by a pair (or set of pairs) of the form (v, rv ) where v is a class name with mv state attributes, and rv is a formula of the form:
rv = U p1 (U 1) &…& pn ( Un )
The formula rv has mv distinguished variables. The pi 's are any of the classes on the federated level. The class name v is a new name describing an information source. This means that the source can be asked a query of the form v(Z) (or any partial instantiation of it), and returns instances with mv state attributes that satisfy the following implication:
Z (v( Z)) => rv(Z))
Simplified source capability model (input bindings, output, selections):
R1(Y1, ... , Yk):- R(X1, ... , Xm), 1 = a1, ... , n = an, = Y1, ... , k = Yk, 1, ... , h
Sound and Relevant Query Plans
A simplified query Q to the mediator can be represented as a conjunction:
Q(Y) : X p1 (X1) & … & pn (Xn );
X , X 1 , … , Xn are tuples of variables or constants and the pi 's are any of the classes on the federated level. The answer to the query is the set of bindings that can be obtained for the variables in Y.
Given a query of the form above, the query processor generates a set of conjunctive plans for answering Q(Y) as formulae of the form:
Q(Y): U v1 (U1) & … & vk (Uk ) & Cp
where each of the vi 's is a class name associated with an information source, and Cp is a conjunction of atoms of order relations. Note that the distinguished variables in the plan are the same as the ones in the query. Given a conjunctive plan P , the descriptions of the information sources imply that the following constraints hold on the answers it produces: (recall that rvi is the formula describing the constraints on the instances found in v i )
ConP : rv1 (U1) & … & rvk (Uk) & Cp
Sound and Relevant Query Plans
Definition: A conjunctive plan P is sound if all the answers it produces are guaranteed to be answers to the query, i.e., if the following entailment holds:
Y (ConP) => X p1(X1) & … & pn (Xn)
Several conjunctive plans to answer a query are required because the information sources are not complete.
Definition: A conjunctive plan P is relevant to a query Q(Y) : X p1(X1) &…& pn (Xn ) if the sentence Y,X (Conp & p1(X1) & … & pn(Xn)) is satisfiable.
Plan GenerationFirst step: separately for each subgoal in the query, compute which information sources are relevant to it and collect such sources into respective buckets. An information source is relevant to a subgoal g if, the description of the source contains a subgoal g1 that can be unified with g, such that after the unification, the constraints in the query and the constraints in the source description are mutually satisfiable.
‘Satisfiable’ means that the conjunction of built-in atoms should be satisfiable and there are no two subgoals C(x) and D(x) where C and D are disjoint classes. ‘Mutually satisfiable’ means that if C(Q) and C(U) are the conjunction of constraint subgoals in query and source, then C(Q) & C(U) should be satisfiable.
Second step: conjunctive plans constructed are analyzed by choosing one relevant source for every subgoal in the query, and check each plan for soundness and relevance. Specifically, it is considered every conjunctive plan Q1 of the form
Q1(Y) : ( U) v1(U1) & … & vn(Un)
where vi(Ui) has been deemed relevant to subgoal pi in the query. Each such conjunctive plan should be checked that it is (1) relevant, (2) sound (if it is not a sound plan, it is checked whether it can be made sound by adding conjuncts of order predicates), and (3) minimal (i.e., we cannot remove a subgoal from the plan and still obtain a sound plan).
Plan Generation
Usually these properties are checked using algorithms for containment of conjunctive queries. The algorithm should guarantee to produce only sound and relevant plans.
Whether the algorithm produces all the necessary conjunctive plans ? The answer is based on the close relationship between the problem of finding conjunctive plans and the problem of answering queries using materialized views.
The cost of checking minimality and soundness of a conjunctive plan is exponential, it is exponential only in the size of the query, which tends to be small, and not in the number of information sources or their contents.
Query Containment Algorithms
• Basic techniques (e.g., QinP (Ullman): Containment of conjunctive queries in logical recursions, negation in conjunctive queries by Chan)
• Extensions:
1. Containment for queries with complex objects. Typing constraints and integrity constraints for object DB schemas
2. Relative containment
3. Conjunctive queries with regular expressions Query containment under constraints
4. Bag containment of conjunctive queries
• Alternative techniques
1. Counter machines to study query containment
2. Verification of knowledge bases
3. Description Logics
Containment of Conjunctive Queries in Logical Recursions (QinP)
An algorithm testing whether a conjunctive query is contained in the relation defined by a logic program.
Given are a conjunctive query Q, represented as:
H :- G1 & … & Gk and a logic program P.
To decide whether Q P:
1) Assign to every variable in Q a unique constant.2) Form EDB relation from the subgoals of Q.3) Evaluate P (bottom-up) as DB relation4) If EDB is contained in DB then Q P
A Query Converter for Wrappers Toolkit
In Tsimmis query converter is a part of the Wrapper implementation toolkit. MSL logic-based, OEM-oriented query language is used.
Source capabilities are defined with templates in a Query Description and Translation Language (QDTL). Each template can be associated with an action that generates the commands for the underlying source.
The converter will process:
Directly supported queries. These are queries that syntactically match a template. Logically supported queries. These are queries that produce the same results as a directly supported query. The notion of logical equivalence is used to detect queries that fall in this class. Indirectly supported queries. These are queries that can be executed in two steps: first a directly supported query is executed, and then a filter is applied to the results of the first step.
Detection of maximal supporting query and of a filter
A query qs is a maximal supporting query of query q with respect to capability
description if qs is directly supported by d, qs indirectly supports q1, and there is no
directly supported query q’s that indirectly supports q1 , is subsumed by qs, and is not
logically equivalent to qs There may be more than one maximal supporting query for a given query.
Capability description D is expressed as a (possibly recursive) Datalog program.The problem of determining if a description D supports query Q, is the same as the problem of determining if program P(D) contains (subsumes) query Q and if a corresponding filter query exists. A supporting query is found in two steps:
1. find a subsuming query, and2. find the corresponding filter.
The approach is based on the extended Ullman query containment algorithm (X-QinP) that gives yes/no answer to the containment question.
The algorithm is extended to find the actual maximal supporting queries and also the native query constituents for the underlying source.
Known modifications of query rewritingalgorithms using views
1. Conjunctive queries
2. Source templates with binding patterns
3. Recursive queries
4. Views in description logic
5. Rewriting for semistructured data. Regular expressions rewriting, navigational plans
6. Boolean queries rewriting
7. Queries with union and aggregation
8. Type inferencing
9. Object fusion
10. Scalable technique
Infrastructure of the mediator aiming at semantic interoperability of collections
Outline :
• Heterogeneity of the mediator Canonical information model Mediator’s metadata Information extraction framework Collection registration at a mediator as a process of compositional development
Heterogeneous information models absorbed by the canonical model
Core
Canonical Model
Extensions
Component Models(IDL, CDL, BOF)
Object & HeterogeneousDB Models
(ODL, SQL3, Garlic)
Knowledge BaseRepresentations
(OKBC, Ontolingua)
Unstructured Data(vocabularies, thesauri)
Semistructured Data Models
(OEM, ADM, OQL-doc)
Document ObjectModel
Metadata for DL(Dublin Core, Warwick,
Starts, Z.39.50)
Metadata Expressiblein Meta Models
(MOF, RDF)
is_refined_by
Workflow Models
Canonical Model Entities
Metaclass
Class Type
Object Frame
Collection World
instance_of
superclass supertype
Abstract Value
type
instance_ofinstance_of instance_of
becomes an object
type
instance type
typeinstance_of
instance_of
typeinstance type
instance instance typeinstance_ofinstance_of
Canonical Information Model
A set of the canonical model facilities used for the uniform representation of the information resources includes the following:
• Frame representation facilities. Frames are treated as a special kind of abstract values introduced mostly for description of concepts, terminological and weakly-structured information. All specifications in canonical model have a form of frames that become a part of the metabase.
• Unifying type system. A universal constructor of arbitrary abstract data types as well as a comprehensive collection of the built-in types are included into a type system.
• Class representation. Classes provide for representing of sets of homogeneous entities of an application domain. Class instances (objects) have specific types.
• Multiactivity (workflow) representation. These are used for the specification and implementation of interconnected and interdependent application activities, for the specificaton of declarative assertions and concurrent megaprograms over the information resources.
• Facilities for the logical formulae expressions. A multisorted object calculus (typed first-order language) is used for querying the integrated set of digital collections as well as for specification of constraints and behaviour.
Mediator’s Metadata Layering
Structured Collection
Schema Semistructured Collection Unstructured Collection
Vocabulary/Thesauri
Schema Schema VocabularyVocabulary
Thesauri
Federated SchemaCommon Thesauri
Core Extension
Subject Classification Hierarchy & Context(metaclass hierarchy & ontological definitions)
Specific VocabularyPersonalizedDL Level
Interoperable(Federated)Level
LocalLevel
Real CollectionLevel
Ontology Ontology Ontology
Views Subschemas
Information Extraction Framework
XML data system Z39.50 serverinformation
retrieval systemmolecular biology
data banks
XMLwrapper
Z39.50wrapper
information retrievalsystem wrapper
SRS wrapper
http Z39.50 IIOP http
Mediator’s DBMS(object-relational DBMS)
Query Engine•canonical mediator’s query language•best relevant collection identification•query decomposition•query planning and monitoring
Graphical QueryFacilities
Outcome Presentation
Java / CORBA
Personalized DLCanonical GUI Personalized DLPersonalizationFacilities
InformationExtractionFacilities
LocalizationFacilities
LocalCollections
•ranking•merging•aggregation•summarization
metadatarepository
data
Metainformation Repository
Value
Frame Slot
AttributeModule Type Class
Schema ADT
CEntityFunction Concept
Reduct CompType
View CategoryMetaclass
Simulating
slots
*1
instances
*
1
instType1 *
instInstType1
*
simulatings*
1
Collection Registration Framework
The framework facilities are intended to support functions of collection contextualizing:
• constructing mapping of a collection data model and metadata into the canonical ones;
• representation of the new metainformation in terms of the federated mediator's level;
• inferring from the collection the required information for the federated level;
• semi-automatic construction of a collection wrapper;
• connecting the wrapper to the interoperation environment (e.g., CORBA).
Contextualization of Ontology
• mapping of local ontological context to that of the mediator– by names and relationships
– by natural language description
– applying structural integration to concept specifications
– introducing new concepts over existing ones
• contextualization through structural correlation– establishing weak ontological relevance of specification elements
applying analysis of intercontext concept relationships
– establishing tight ontological relevance of specification elements introducing a subsumption relationship between concepts
Correlation of Ontological Concepts
• evaluation of descriptor weights
• establishing intercontext relationships between concepts
t
VkYk
t
VkXk
t
VVkYkXk
YX
YX
WW
WW
YXsim22
,
XVi ii
kk
Xk
nN
f
nN
f
W2
log
log
t
VkXk
t
VVkYkXk
X
YX
W
WWX,Yr
2
,min
t
VkYk
t
VVkYkXk
Y
YX
W
WWY,Xr
2
,min
Ontological Metainformation
Class ADT
-code: string
Category
-definition: string-wordClass: string
Concept
type
1*
-weight: float-name: string
Descriptor
descriptors
descriptorOf
*
1
-strength: float=1
ConceptRel
-weight: float-frequency: float-name: string
ConceptWeight
fromRelationtoConcept
*1
toRelation
fromConcept *1
weights
weightOf
*
1foreign*
*
collection1
*concept 1
*
*
1 category
PositiveRel
NarrowRel
PartRel
RelativeRel
Process of an Information Source Registration
For each source class the following steps (of the compositional development process) are required [LNCS 2151]:
1. relevant federated classes identification
• Find federated classes that ontologically can be used for defining source class extent in terms of federated classes. To a source class several federated classes may correspond covering with their instance types different reducts of an instance type of the source class. On another hand, several source classes may correspond to one federated class.
2. most common reducts construction
For an instance type of each identified federated class do:
• Construct most common reducts for instance type of this federated class and source class instance type to concretize (partially) such federated instance type. Most common reduct may include also additional attributes corresponding to those federated type attributes that can be derived from the source type instances to support them.
• In this process for each attribute type of the common reduct a concretizing type, concretizing function or their combination should be constructed (this step should be recursively applied).
Process of an Information Source Registration
For each source class the following steps are required:
3. partial source view construction
• For each relevant federated class construct a partial source view expressing a constraints in terms of the federated class that should be satisfied by values of respective most common reducts of source class instances. Thus partial views over all relevant federated classes will be obtained.
4. partial views composition
• Construct compositions of the source type most common reducts obtained for instance types of all federated classes involved.
• Construct a source view as a composition of partial views obtained above. This is an expression of a materialized view of an information source in terms of federated classes. An instance type of this view is determined by the most common reducts composition constructed above.
Subject Mediator. Cultural Heritage Collections. Collections Registration
Federated Level Metainformation
Local into Federated Level Mapping
CIMI Profile of z39.50 Louvre Museum Web Site Uffizi Museum Web Site
museum_object
created_by*date_collected*description*object_id*relation*…content_generalcollectionmrObject
creator_c
nationalityworks
department
namedescriptionsections
author
namenationalityworks
artist
namebiographypaint_list
canvas
titlepainterdatehistorydescriptionto_image
creator_c(c/Creator_Creator_Info [name, nationality, date_of_birth, date_of_death, works/{set_of:Heritage_Entity_Museum_Object}]) creator(c[name, nationality, date_of_birth, date_of death, works])
author (a/Creator_Author[name/fname, nationality, works/{set_of:Heritage_Entity_Work}]) creator(name, nationality, works (w)) & c,s ( repository (c/Collection [contains(s/Section)]) & repository.name = ‘Louvre’ & in (w, s.contains) )
artist(a/Creator_Artist[name, nationality, general_info/Text_Textual, works/{set_of: Painting_Canvas}]) creator(a[name, nationality, general_info, works]) & repository (n/name, collection) & n = ‘Uffizi’ & col/Collection ( isempty (intersect (collection(col/ Collection).contains, works)))
Local Views in Terms of Federated Classes
Specifications of Types of the Uffizi Site Schema
-name: string
Repository
-name: string-biography: Textual
Artist
-title: Textual-painter: string-culture: Textual-date: time-description: Textual
Canvas
-room_no: string-room_name: Textual
Room
Image
{ordered}1
authors
{ordered}
1
contains
{ordered}1
paint_list
{ordered}
1 paint_list{ordered}
1
room_list
11 to_image
Specifications of Types of the Federated Schema
-name: string-nationality: string-date_of_birth: time-date_of_death: time-residence: Address
Person
-culture_race: string-general_info: Text
Creator
-title: Text-date: time-narrative: Text
Entity
created_by1
1
-place_of_origin: Address-date_of_origin: time-content: Text
Heritage_Entity
-dimensions: {sequence; type_of_element: integer}
Painting
-type_spicemen: Text-archeology: Text
Antiquities
-name: string-place: Address-description: Text
Repository
-name: Text-location: Address-description: Text
Collectioncollections
in_repository *
1
1 *
works
in_collection
contains
1
*
Digital_Entity1 1
digital_form
Most Common Reduct (Example){CR_Painting_Canvas;
in: c_reduct;
metaslot
of: Canvas;
taking: {title, painter, date, description, to_image};
reduct: R_Painting_Canvas
end;
simulating: {
R_Painting_Canvas.title ~ CR_Painting_Canvas.title;
R_Painting_Canvas.created_by ~
CR_Painting_Canvas.get_created_by;
R_Painting_Canvas.date_of_origin ~ CR_Painting_Canvas.date;
... };
get_created_by: {in: function;
params: {+ext/CR_Painting_Canvas, -returns/Creator};
predicative: {ex c/Canvas ((c/CR_Painting_Canvas = ext) &
ex a/Artist ((c.painter = a.name) &
returns = a/CR_Creator_Artist)))}}
...
}
Partial Source View Construction (Example)
The formula expressing the local class canvas is terms of the federated class painting is defined as:
canvas(p/CR_Painting_Canvas) painting(p/R_Painting_Canvas) & p.in_collection.in_repository = 'Uffizi‘
Specification of a class (actually, this is local as view class) containing this formula is:
{v_canvas_painting; in: class; class_section: { key: invariant, {unique; {title}}; lav: invariant, {subseteq (v_canvas_painting(p), painting(p/R_Painting_Canvas) & p.in_collection.in_repository = 'Uffizi')} }; instance_section: CR_Painting_Canvas}
Source View Composition (Example)
A final formula for a local class canvas in terms of the federated classes painting and creator is:
canvas(p/CR_Painting_Creator_Canvas)
painting(p/R_Painting_Canvas) & p.in_collection.in_repository = 'Uffizi' &
creator(c/R_Creator_Canvas) & w/Painting (in(w, c.works) &
w.in_collection.in_repository = 'Uffizi')
Complete definition of source view looks as follows:
{v_canvas;
in: class;
class_section: {
key: invariant, {unique; {title}};
lav: invariant, {subseteq(v_canvas,
painting(p/R_Painting_Canvas) &
p.in_collection.in_repository = 'Uffizi' &
creator(c/R_Creator_Canvas) & ex w/Painting
in(w,c.works)& w.in_collection.in_repository = 'Uffizi')})
};
instance_section: CR_Painting_Creator_Canvas
}
CR_Painting_Creator_Canvas = CR_Painting_Canvas ⌴ CR_Creator_Canvas
Structure of the Collection Registration Tool
source collection context / mediator metadata reconciliation
construction source class specifications as views over federated classes
most common reduct identification
Collection Registration Tool Mediator’s DBMS (Oracle 8i)
wrapper generation
B-Toolkit
metainformationrepository
B-AMN
wrapper code
Summary
• Subject domain mediation has good perspectives for heterogeneous information sources integration in process of formation of professional communities around Internet
• ‘Local as view’ approach looks promising for the worlds of multiple dynamically changing sources (content, availability) providing also for mediator’s scalability
• Widely known mediator projects and related researches contributed a lot to mediator definition, query planning, source capability description and wrapper generation
• Many serious gaps remain, e.g., mostly relational models were studied, conjunctive queries were supported, thesauri and ontologies have not been sufficiently involved, query containment were studied for precise queries (querying of textual, multimedia, object and semistructured data may require reconsideration), problem of source view registration for LAV approach had not been studied, mediator composition problems have not been investigated
• Therefore, the area looks fruitful for research , experimentation and development.