cs.bzu.edu.pk · web viewa variety of rdf storage systems and browsers are available such as jena...
TRANSCRIPT
TOPIC
By
XYZ
Supervisor
Dr
A thesis submitted in partial fulfillment of
The requirements for the degree of
Masters in Computer Science
In
Department of Computer Science
Pakistan
(July 2018)
APPROVAL
It is certified that the contents and form of thesis entitled “” submitted, have been found
satisfactory for the requirement of degree.
Advisor: __________________
Committee Member: _________________
Committee Member: _________________
Committee Member: _________________
II
IN THE NAME OF ALMIGHTY ALLAH
THE MOST BENEFICENT AND THE MOST MERCIFUL
TO MY PARENTS,
BROTHER AND SISTERS
III
CERTIFICATE OF ORIGINALITY
I hereby declare that this submission is my own work and to the best of my knowledge it
contains no materials previously published or written by another person, nor material which to a
substantial extent has been accepted for the award of any degree or diploma at BZU or at any
other educational institute, except where due acknowledgement has been made in the thesis. Any
contribution made to the research by others, with whom I have worked at BZU or elsewhere, is
explicitly acknowledged in the thesis.
I also declare that the intellectual content of this thesis is the product of my own work, except for
the assistance from others in the project’s design and conception or in style, presentation and
linguistics which has been acknowledged.
Author Name:
Signature: ______________
IV
ACKNOWLEDGEMENTS
First of all I am extremely thankful to Almighty Allah for giving me courage and strength to
complete this challenging task and to compete with international research community. I am also
grateful to my family, especially my parents who have supported and encouraged me through
their prayers that have always been with me.
I am highly thankful to for his valuable suggestions and continuous guidance throughout my
research work. His foresightedness and critical analysis of things taught me a lot about valuable
research which will be more helpful to me in my practical life.
I would like to offer my gratitude to all the members of the research group and my close
colleagues who have been encouraging me throughout my research work especially Mr Maruf
Pasha.
V
TABLE OF CONTENTS
List of Figures VIII
List of Tables VIII
List of Abbreviations X
ABSTRACT XI
CHAPTER 1 1
INTRODUCTION 1
1.1. Motivation 1
1.2. Problem Definition 2
1.3. Objective and Goals of Research 3
1.4. Outlines of Thesis 4
CHAPTER 2 5
BACKGROUND STUDIES 5
2.1. Data Integration 5
2.2. Issues in data integration 6
2.3. Approaches to Data Integration 7
2.4. Query Processing in Data integration 9
2.5. Ontology 10
2.6. Indexing 13
CHAPTER 3 17
LITERATURE SURVEY 16
3.1. Query Reformulation 16
3.2. State of the art techniques 16
VI
CHAPTER 4 23
PROPOSED ARCHITECTURE 21
4.1. Proposed Architecture for the Relevance Reasoning 21
4.2. Semantic Matching & Source Ranking of RDF Triples 25
4.3. Proposed Semantic Matching Methodology 28
4.4. Explanation of Proposed Methodology using a Case Study 37
CHAPTER 5 43
IMPLEMENTATION 43
5.1. RDF data/ Ontologies in Oracle Database 43
5.2. Setting up the Stage for Implementation 47
5.3. Implementation of the Proposed Architecture for Relevance Reasoning 51
CHAPTER 6 58
RESULTS AND EVALUATION 57
6.1. System Specification 57
6.2. Evaluation Criteria 57
6.3. Data Specification 58
6.4. Test Queries 58
6.5. Experiments for Response Time of Query Execution 59
6.6. Experiments for System Accuracy 62
CHAPTER 7 64
CONCLUSION AND FUTURE DIRECTIONS 64
7.1. Discussion 64
7.2. Main Contribution of the Project 65
7.3. Future Direction 66
REFERENCES 67
VII
LIST OF FIGURES
Figure 1: Data Warehousing Architecture for Data Integration......................................................8
Figure 2 Mediator Wrapper Architecture for data integration.........................................................9
Figure 3 RDF Triple as Directed Graph........................................................................................12
Figure 4: Structure of a bitmap index............................................................................................15
Figure 5 Proposed Architecture for Relevance Reasoning in Data Integration Systems..............22
Figure 6: Sequence Diagram for Ontology Management Workflow............................................29
Figure 7 Pseudo-code for RDF triple registration of global ontology...........................................30
Figure 8 InverseOf SameAs rule inserted in the rule-base............................................................30
Figure 9 TransitiveOf SameAs rule inserted in the rule-base........................................................31
Figure 10 Pseudo-code for RDF triple creation of local ontology................................................32
Figure 11 Pseudo-Code for Bitmap Segment Creation.................................................................32
Figure 12 Pseudo-Code for Bitmap Synchronization....................................................................33
Figure 13: Sequence Diagram for Source Registration Workflow................................................34
Figure 14: Sequence Diagram for Relevance Reasoning Workflow.............................................35
Figure 15 Pseudo-Code for Query Expansion in Relevance Reasoning Workflow......................36
Figure 16 Pseudo-Code for Source Selection in Relevance Reasoning Workflow.......................37
Figure 17 Snap shot of the Global Ontology.................................................................................37
Figure 18 Concept & Relationship Hierarchies Managed using Semantic Operators over Global
Ontology........................................................................................................................................38
Figure 19 Database Schema to store ontology in Oracle NDM....................................................44
Figure 20 Package Diagram of the Proposed Architecture for Relevance Reasoning..................51
Figure 21 Time Complexity of System (Query with 3 Triples)....................................................60
Figure 22 Time Complexity of System (Query with 6 Triples)....................................................60
Figure 23 Time Complexity of System (Query with 9 Triples)....................................................61
Figure 24 Performance gain of the system with respect to direct ontology traversal....................61
Figure 25 : Precision Vs Recall comparison of the proposed methodology with MiniCon
algorithm........................................................................................................................................62
VIII
List of Tables
Table 1 Relevance levels and scoring strategy..............................................................................27
Table 2 RDF triples of the Global Ontology.................................................................................38
Table 3 Structure of Bitmap Index................................................................................................38
Table 4 RDF triples of the data sources.........................................................................................39
Table 5 Structure of Bitmap Index after sources are registered....................................................39
Table 6 Buckets created for the RDF triples.................................................................................39
Table 7 Inferred RDF triples for a user’s query triple...................................................................40
Table 8: Semantic Similarity Calculation of a Data Source for a User Query Triple...................42
Table 9: Semantic Similarity Calculation of a Data Source for User Query.................................42
IX
LIST OF ABBREVIATIONS
XML Extensible Markup Language
WWW World Wide Web
DAML DARPA Agent Markup Language
OWL Ontology Web Language
API Application Programming Interface
DIS Data Integration Systems
NDM Network Data Model
RDF Resource Description Framework
W3C World Wide Web Consortium
URL Uniform (Universal) Resource Locator
ICT Information and Communication Technologies
AI Artificial Intelligence
UMLS Unified Medical Language System
IM Information Manifold
GUID Global Unique Identifier
LUID Local Unique Identifier
SDS Source Description Storage
X
XI
ABSTRACT
Online data sources are autonomous, heterogeneous and geographically distributed. The
data sources can join and leave a data integration system arbitrarily. Some sources may not
contribute significantly to a user query because they are not relevant to it. Executing queries
against all the available data sources consume resources unreasonably and subsequently these
queries become expensive.
Source selection is an approach to resolve the issue. The existing techniques of relevance
reasoning for source selection take significant time in traversing the source descriptions.
Consequently query response time degrades in coping with the growing number of available
sources. Moreover, simple matching process is unable to sort out the fine-grained semantic
heterogeneities of data. Semantic heterogeneity of data sources makes the relevance reasoning
complex. These issues degrade the performance of data integration systems.
In this research, we have proposed an ontology-driven relevance reasoning architecture
that identifies relevant data sources for a user query before its execution. The proposed
methodology aligns source descriptions (i.e. local ontologies) with domain ontology through a
bitmap index. In spite of traversing the local ontologies, the methodology utilizes the bitmap
index to perform relevance reasoning in order to improve query response. Semantic matching has
been employed in relevance reasoning for the provision of semantic interoperability. Semantic
operators, such as, exactMatch, sameAs, equivalentOf, subClassOf, and disjointFrom, have been
introduced to sort out fine-grained semantic heterogeneities among data sources. Quantitative
scores are assigned to the operators. Data sources are ranked based on the similarity score
obtained by them.
XII
A prototype system has been designed and implemented to validate the methodology. The
evaluation criteria used are (a) query response time and (b) accuracy of relevant source selection.
The prototype system has been compared with the existing systems for evaluation. Query
response time and accuracy of source selection, in terms of precision and recall; have been
improved due to the incorporation of a bitmap index and ontology respectively.
XIII
INTRODUCTION
This chapter introduces the research work that has been taken in this thesis. It
includes motivation and definition of the problem. Moreover the objectives and goals
have also been discussed.
1 Motivation
The exponential growth in data sources on the Internet is due to advancements in
information and communication technologies (ICT). Some data sources contain
interrelated data that could answer a user query. Retrieving data from these interrelated
data sources is a non trivial task due to their properties i.e. autonomy, heterogeneity and
geographical distribution [1, 8, 11, 23]. The sources can be heterogeneous in terms of
syntax, schema, or semantics. The task of a data integration system is to enable the
interoperation of autonomous and distributed data sources for knowledge discovery
through a centralized access point. It provides a uniform query interface that gives a user
transparent access for querying data sources. However the properties, discussed above,
make integration among the sources a pervasive challenge and a crucial task [1, 8, 23].
A variety of approaches to data integration exists. These approaches can be generally
classified into two major categories: (a) data warehousing and (b) mediation [1, 28]. In
data warehousing, the required data is extracted from the sources and stored in a
centralized repository after integration. While in mediation, data is gathered and
integrated when a user query is submitted. Query execution is efficient and response
time is predictable in warehousing, but result is stale. In contrary, query efficiency is
slow in mediation but result is up to date [1, 21, 28].
1
CHAPTER 1
The growth of online data sources requires a scalable data integration system because
the sources are unpredictable due to their autonomy. In other words, data sources can join
and leave the system arbitrarily. Thus checking the availability of a data source before
executing a query is needed. Moreover all the data sources may not have the required
information. Executing a query on all data sources is an expensive solution due to the fact
that an available source may not contribute any significant information to the user query
result [8, 20, 23]. In order to execute queries efficiently in these systems, we need to
identify relevant and effective data sources that are available at the time of execution.
This research work focuses on relevance reasoning for identifying relevant and effective
data sources in a scalable data integration system.
2 Problem Definition
Identifying relevant sources in a scalable data integration system faces problems due
to semantic heterogeneity and lack of performance. We have highlighted in depth these
problems in the following paragraphs of this section.
Semantic Heterogeneity: Data sources are being developed by independent
organizations so there might be semantic differences between their schemas [20].
In different data sources, a same concept may be represented with different names
such as, instructor, teacher or lecturer. Similarly different concepts in different
data sources may be represented by same name such as bank i.e., a bank can be a
river bank or a financial institution.
Performance in Query Response Time: Some data sources may or may not
contribute significantly to a user query because they are not relevant. Executing a
2
query on all available data sources, without any estimation about their relevance
for a user query, degrades the performance of the query. This leads to
unreasonable wastage of resources of the data integration systems.
3 Objective and Goals of Research
The goal of this research is to provide a mechanism for relevance reasoning in a
scalable data integration system generally. In particular our objective is to work on
relevance reasoning in the following directions.
Provision of Semantic Interoperability in Relevance Reasoning: Ontology,
initially developed by artificial intelligence community for knowledge sharing
and reuse, is a formal, explicit specification of a shared conceptualization [5].
Ontology is largely used for representing domain knowledge and can play a vital
role in reconciling the semantic heterogeneities due to its representational and
expressive capabilities [3, 4]. In this research, we are exploiting the capabilities of
domain ontology for the provision of semantic interoperability to handle the
source heterogeneities during relevance reasoning.
Optimization of Relevance Reasoning Mechanism: Indexing structures are
used in databases to access data efficiently [27, 28]. We have proposed semantic
indexing using bitmap technique to represent the metadata of data sources. A user
query is executed through the bitmap index for identifying relevant data sources.
The index performs relevance reasoning in improved manner thereby enhances
query response time.
3
4 Outlines of Thesis
The rest of the document is organized as follows: Chapter 2 describes a data integration
system and its various components. RDF is also explained as a language for developing
ontologies, storing source descriptions and semantic mappings. Chapter 3 discusses
various algorithms for relevance reasoning and their critical analysis. Chapter 4
highlights the proposed system architecture, proposed semantic matching process along
with the proposed methodology for relevance reasoning. Chapter 5 gives complete
overview of implementation details. Chapter 6 highlights experimentation and
comparative analysis to validate the proposed architecture. Discussions are also made on
the conducted experiments. Chapter 7 concludes the thesis and defines future research
directions.
4
BACKGROUND STUDIES
This chapter provides background literature in order to understand the context of this
research. Data integration and semantic heterogeneity is discussed. The details of
ontology and its designing methodology; and indexing have also been included.
5 Data Integration
Data sources on the Internet are growing exponentially in size and number over the
time. These data sources contain information about different topics such as stock market,
product information, real estate, and entertainment. The data from these sources can be
used for answering complex user queries and this might go beyond the traditional
searches. Advancements in information and communication technology has enabled the
users to access a wide array of data sources that are related in some way and to integrate
the result to come up with useful information that might not be stored physically in a
single place [1, 8, 12, 24].
Data integration enables the interoperability of the data sources for knowledge
discovery through a centralized access point, and provides a uniform query interface that
gives user the illusion of querying a homogeneous system [2, 15, 19, 31]. In data
integration a user is provided with a unified interface for posing queries, which is based
on a schema typically referred as the global schema or mediated schema. Based on the
approach used to develop data integration systems, a user is provided with appropriate
result obtained from underlying data sources either from centrally materialized repository
or at real time.
5
CHAPTER 2
6 Issues in data integration
Data sources in data integration are maintained by different organizations, are located
geographically distributed, and managed autonomously. This scenario creates a variety of
barriers in integrating data from these participating data sources. Most common issues
include (a) autonomy, and (b) semantic heterogeneity. In order to achieve scalable data
integration these issues need to be sorted out.
2.1.1. Autonomy: In data integration, autonomy indicates the ability of data sources to
control their data and processing capabilities. The data sources retain their autonomy
even after becoming a part of data integration [24, 31]. This autonomous scenario arises
the following issues:
- The source data administrators might not be interested in, or may not have the
resources, to help the integrators to understand how their site's schema relates to
the schemas of other sites being integrated.
- The source data administrators might change their site's schema without
forewarning the integrators and can lead the integration software to make invalid
assumptions about the data source.
- The data source administrators might choose a schema that is very difficult to
integrate with the other schemas in the integrated system.
2.1.2. Semantic Heterogeneity: In data integration, heterogeneities come from different
programming and data models as well as from different conceptualization of a real world
object. Among these heterogeneities is the semantic heterogeneity [20]. A variety of
semantic heterogeneities can be found in the different data sources. A few of semantic
heterogeneities are:
6
2.1.2.1. Synonym: The same concept may be represented with different names in
different data sources e.g., Course, Subject.
2.1.2.2. Homonym: The different concepts in different data sources may be
represented by same name e.g., bear can be an animal or a property meaning tolerate.
2.1.2.3. Degree of likelihood: Two concepts can be relevant to each other on the
basis of degree of likelihood. This does not mean equality of concepts like synonyms,
rather relatedness e.g., <:Teacher :isTeaching :Course> and
<:TeachingAssistant :isAssisting :Course>, here teaching assistant and teacher is not
same concepts but are relevant to each other with certain degree of likelihood.
7 Approaches to Data Integration
A variety of approaches to data integration exists. These approaches can be generally
classified into two major categories: (a) data warehousing and (b) mediation.
2.1.3. Warehouse: In data warehousing, the required data is extracted from the sources
and stored in a centralized repository after integration [19, 24]. Users pose queries against
the data model of the warehouse. This approach is also known as eager approach or
materialized view approach to data integration. Query execution is efficient and response
time is predictable in this approach, but result can be stale mostly [1]. Figure 1 shows
data warehousing architecture [24].
7
2.1.4. Mediation: In mediation approach, a user is given a unified schema for posing a
query that contains virtual relations. Data is not loaded in a central repository in advance
in this approach rather queries are executed at run time [1, 19, 20, 24, 24]. In order to
answer a user query using the information sources, metadata is needed that describe the
semantic relationship between the elements of mediated schema and schemas of
underlying data sources. This metadata is known as source description. This approach is
also known as lazy approach or virtual view approach to data integration. Query
efficiency is slow in mediation but result is up to date [1, 21, 24]. Figure 2 depicts
mediation based architecture for data integration [24].
8
Figure 1: Data Warehousing Architecture for Data Integration
8 Query Processing in Data integration
The main objective of data integration is to facilitate access to a set of autonomous,
heterogeneous and distributed data sources. The ability to efficiently and correctly
execute a query over the integrated data lies in the heart of data integration. Main steps in
processing a query in data integration include (1) Query reformulation, (2) Query
planning and execution.
2.1.5. Query Reformulation: Query reformulation is the first step in query processing
where a user query previously written in terms of a mediated schema is reformulated
using information about sources into queries that refer directly to the schemas of
underlying data sources [1, 8, 10, 11, 19, 24]. Query reformulation is further divided into
two steps: (a) source identification (b) query rewriting.
2.1.5.1. Source identification: Before executing a user query, relevant and
effective sources should be clearly identified to optimize query execution. Relevance
reasoning is the process of identifying relevant sources and pruning irrelevant and
redundant data sources. The main focus of our research is to propose an algorithm that
can speed up the process of relevance reasoning.
9
Figure 2 Mediator Wrapper Architecture for data integration
2.1.5.2. Query rewriting: Once relevant sources are being identified then query
rewriting is performed and source specific queries are reformulated only for those sources
that have been found relevant and can contribute some result to the user’s query.
2.1.6. Query Planning and Execution: Query reformulation provides some
optimizations by pruning irrelevant sources and overlapping sources to avoid redundant
computation. The reformulated queries are evaluated using different strategies producing
multiple execution plans during the optimization [11, 12]. The query execution engine
executes these queries using best and cheapest execution plan and deals with limitation
and capabilities of the data sources [28]. During execution, an important issue is to
minimize time to return the first answers to the query rather minimizing the total amount
of work to be done to execute whole query [21, 24].
9 Ontology
Ontology is defined as an explicit and formal specification of a shared
conceptualization [3, 4, 15]. In this definition, the term conceptualization refers to an
abstract model of some domain knowledge that identifies relevant concepts of the
domain. The term shared indicates that ontology captures consensual knowledge that is
accepted by a group of people and systems. The term explicit means that concepts and the
constraints on these concepts are explicitly defined. Finally, the term formal means that
the ontology should be machine understandable [15]. Ontology was initially developed
by the Artificial Intelligence (AI) community to facilitate knowledge sharing and reuse.
Ontology carries semantics for a particular domain and hence used for representing
domain knowledge. Ontology is widely used in data standardization and
conceptualization. Ontologies have proven to be an essential element in many
10
applications including agent systems, knowledge management systems, and e-commerce
systems. They can also, generate natural language like queries, integrate intelligent
information, and provide semantic based access to the Internet [36]. Ontology can be a
taxonomy e.g., Yahoo categories or a domain specific standard terminology e.g., UMLS
and Gene Ontology or an online lexicon database e.g., Word Net.
Ontology consists of concepts, properties, and individuals. A concept is a thing of
significance in the real world. Concepts may be organized into super-class and subclass
hierarchy which is also known as taxonomy where subclasses specialize their super-
classes. Concepts in ontology can be synonyms or disjoint. Properties represent
relationships between two concepts. Properties may have a domain and a specified range.
Properties may be inverse, functional, transitive, or symmetric. Individuals represent
objects in the domain. Ontology needs a reasoner which can check whether or not all of
the statements and definitions in the ontology are mutually consistent and can also
recognize which concepts fit under which definitions. The reasoner can help to maintain
the hierarchy correctly.
2.2. Ontology Modeling Languages: In order to develop ontology-driven
applications, a language is needed to facilitate the semantic representation of the
information, required by these applications. A number of research groups have already
identified a need for a more powerful ontology modeling language. This need for a
powerful modeling language, leads to joint initiatives of building languages. Therefore, a
number of ontology modeling languages are available and are being used today [36].
Most common ontology modeling languages include XML Schema [35], DAML+OIL
[37], RDF and RDFS [25], and OWL [38]. Among all these ontology languages, we are
11
most interested in RDF and RDFS for their role in data integration and semantic web [4,
6, 25, 26].
2.2.1. RDF and RDFS: Resource Description Framework (RDF) is a standard -
developed by World Wide Web Consortium (W3C), for representing information about
resources. RDF provides interoperability across resources due to its simplistic structure.
RDF schema (RDFS) is a language for describing vocabularies of RDF data in terms of
primitives such as Class, Property, domain, and range. The machine-understandable
format of RDF facilitates the automated processing of web resources [5, 6, 26]. In RDF, a
pair of resources (nodes) connected by a property (edge) forms a statement: (resource,
property, value), often called an RDF triple. A set of triples is known as model or graph.
The components of a triple include a subject, a predicate or property, and object. Each
triple represents a complete and unique fact for a specific domain. It can be modeled as a
link in a directed graph as shown in Figure 3. The subject is the start node of the link and
the object is the end node of the link. The direction of the link always points towards the
object. A detailed description of RDF language can be found in [25].
Some of the important concepts of RDF are discussed below:
- A URI is a more generic form of Uniform Resource Locator (URL). It allows us to
locate a web resource without specific network address
(http://www.niit.edu.pk/delsa#Instructor).
- A blank node is used when either the subject or object of a triple are unknown or
relationship between the subject and object is n-ary.
12
Subject
Object
Figure 3 RDF Triple as Directed Graph
- A literal is a string which is used to represent names, dates, and numbers.
- A typed literal is a string combined with its data type
(e.g.“Smith”^^http://www.w3.org-/2001/XMLSchema#string).
- A container is a resource that is used to describe a group of things. Participants of a
container are members of the group. Blank nodes are usually used to represent
containers.
- Reification allows triples to be attached to other triples as properties. One of the major
issues is its representational complexity. Therefore it is sometimes termed as “The Big
Ugly”.
A variety of RDF storage systems and browsers are available such as Jena [33],
Kowari [34], Sesame [35], Longwell [36], and Oracle RDF Data Model [37, 40]. We
have used Oracle RDF Data Model for managing global ontology and source descriptions
because it is efficient in terms of storage and is not mitigated by slow performance times.
It provides a basic infrastructure for effectively managing RDF data in databases. At the
same time RDF data can be readily integrated, managed and analyzed with other
enterprise data. A comparative analysis of RDF [26] was conducted and shown that
oracle RDF data model outperforms other existing RDF storage systems.
10 Indexing
Databases spend a lot of their time in finding things. So the finding needs to be
performed as fast as possible to speed up the searching mechanism. Indexes provide the
basis for both rapid random lookups and efficient ordering of access to data. An index is
associated with some search key that is, one or more attributes of a relation for which the
index provides fast access. The disk space required to store an index is typically less than
13
the storage of the table. Indexes can be primary or secondary indexes. A variety of
indexing techniques are used today in modern DBMSs e.g., hash based indexing, cluster
indexing, tree-structured indexing, and bitmap indexing. The most efficient and compact
indexing techniques, that are dealing with bulks of data [26,28], includes (a) B+tree Index
(b) Bitmap Index. In this thesis we are using bitmap indexes due to their internal compact
representation for bulks of data.
2.2.2. Bitmap Index: A bitmap indexing is a specialized technique that is geared
towards easy querying based on multiple search keys. In bitmap index, attributes can be
stratified into relatively a small number of possible values and then queried based on that
stratification. Internally bitmap index entries have bitmap vectors of ‘0’s and ‘1’s. Figure
4 depicts the structure of bitmap index. Bitmap indexing can benefits applications where
ad-hoc queries are being executed on large amounts of data with a low level of
concurrent transactions [26, 28]. The purpose of using bitmap index in our approach is to
provide pointers to RDF triples for efficient searching. Normal indexing can also be used
to achieve this functionality by storing a RDF triple with each index entry but it
consumes more space than the bitmaps. In bitmap index, a single bitmap vector
represents the status of whole source. Each bit in a bitmap vector corresponds to an RDF
triple. If the bit is set, then it means that the source contains the corresponding RDF
triple. A mapping function is used that converts the bit position to an actual RDF triple.
So the bitmap index provides the same functionality as a regular index even though it
uses a different representation internally. Major benefits of bitmap indexing include:
2.2.2.1. Compact Storage and Reduced Response Time for queries: Fully
indexing an RDF repository with traditional indexes can be prohibitively expensive in
14
terms of space because an index can be several times larger than actual RDF data. Bitmap
indexes are only a fraction of the size of the data being indexed. This compact and
concise representation helps to save space and reduce computation while searching for a
RDF triple.
2.2.2.2. Very efficient parallel Data Manipulation and Loading: In our
methodology, sources advertise their capabilities and contents in the form of RDF triples
to the global ontology. A single source may contain bulks of RDF triples. Bitmap indexes
are very efficient in bulk processing of data manipulation statements and data loading.
In a nutshell, we have discussed different data integration approaches that are widely
used now a day. Ontology and its modeling languages have been highlighted because
they can help data integration systems to cope with the semantic heterogeneities that exist
in the domain of discourse. Finally indexing has been discussed in general to speed up the
querying mechanism and in particular bitmap indexing has been explained that can be
used to traverse semantic web metadata efficiently.
15
0
1
0
1
0
1
1
1
0
1
0
1
0
1
1
1
0
1
0
1
0
1
1
1
0
1
0
1
0
1
1
1
0
1
0
1
0
1
1
1
A
X
Y
G
T
U
V
Z
Bitmap VectorsSearch Key
Figure 4: Structure of a bitmap index
LITERATURE SURVEY
Relevant data source selection in query reformulation for data integration systems has
attracted significant attention in the literature over the last few decades [5, 6, 7, 8, 11, 12,
19, 20, 21, 24]. This chapter starts with the discussion and evaluation of state of the art
algorithms used in data integration systems for the identification of relevant data sources
during query reformulation.
11 Query Reformulation
In query reformulation, a user’s query previously written in terms of a mediated
schema, need to be reformulated or rewritten into queries that refer directly to the
schemas of underlying data sources [10, 11, 19, 24]. In literature, query reformulation can
be further sub-divided into two steps: (a) relevant source selection (b) query rewriting.
3.1.1. Relevant source identification: Before executing user queries, relevant and
effective sources should be clearly identified because all the available data sources may
not contribute significantly. Relevance reasoning is the process of identifying relevant
sources and pruning irrelevant and redundant data sources.
3.1.2. Query rewriting: Once relevant sources are being identified then query rewriting
is performed and source specific queries are generated only for those sources that have
been found relevant and can contribute some result to the user’s query.
16
CHAPTER 3
12 State of the art techniques
The main focus of this research is to propose an algorithm that can speed up the
process of relevance reasoning. The following section elaborates state of the art
algorithms that are used in different data integration systems for the relevant source
selection during query reformulation.
3.1.3. The Bucket Algorithm: This algorithm has been used in the Information
Manifold (IM) [1, 20], a system for browsing and querying of multiple networked
information sources. IM provides a mechanism to describe the contents and the
capabilities of data sources in source descriptions (which in our architecture is called
source models). Bucket algorithm uses source descriptions to create query plans that can
access several information sources to answer a query. This algorithm prunes irrelevant
data sources using source descriptions and reformulate source specific queries for only
relevant data sources. In order to describe and reason about the contents of data sources,
the relational model (augmented with certain object oriented features) is used in IM.
Technically, algorithm constructs a number of buckets and checks a user query with each
bucket for the identification of relevant data sources. Once relevant buckets for the
sources are being identified then source specific conjunctive queries are rewritten for
each source.
3.1.4. The Inverse-Rules Algorithm: InfoMaster is an information integration system1
[19] that provides an integrated access to multiple, distributed, and heterogeneous
information sources on the Internet. InfoMaster creates a virtual data warehouse. The
1
http://infomaster.stanford.edu/
17
algorithm used behind the InfoMaster is Inverse-Rules algorithm. Inverse-Rules
algorithm rewrites the definition of data sources by constructing a set of rules. A set of
rules are reformulated for defining the contents and the capabilities of each data source.
During rules construction heterogeneities among the data sources are dealt. These rules
guide the algorithm that how to compute records from data sources using source
definitions. The algorithm dynamically determines an efficient way to answer the user's
query using as few sources as necessary. In simple words, they are not reformulating the
query rather they are reformulating the source definitions so that the original query can be
easily answered on the reformulated rules.
3.2.3 The MiniCon Algorithm: MiniCon algorithm [19, 21] improved the Bucket
algorithm. The main focus of developing MiniCon algorithm is to pay attention to
performance aspects of query reformulation algorithms. MiniCon algorithm finds the
maximally contained rewriting of a conjunctive query using a set of conjunctive views.
Bucket algorithm completes in two steps: computing the buckets, and then reformulating
the source-specific queries using the buckets of those data sources which are relevant.
The main complexities involved in the bucket algorithm include: (a) If the number of
sound data sources is small, the Bucket algorithm may generate a large number of
candidate solutions and then reject them. (b) The exponential conjunctive query
containment test that is used to validate each candidate solution. MiniCon algorithm pays
attention to the interaction of the variables in the user query and in the source definitions
to prune the sources that are rejected later in the containment test. This timely detection
of irrelevant data sources improves the performance of MiniCon algorithm due to small
number of combinations to be checked.
18
3.2.4. The Shared-Variable-Bucket Algorithm: This design goal of this algorithm
[38] is to recover the deficiencies of the Bucket algorithm and develop an efficient
algorithm for query reformulation. The key idea underlying this algorithm is to examine
the shared variables and reduce the bucket contents to reduce view combinations. This
reduction ultimately optimized second phase of the algorithm.
3.2.5. The CoreCover Algorithm: In this algorithm [39], views are materialized from
source relations. The main aim of this algorithm is to find those rewritings which are
guaranteed to produce an optimal physical plan. Their divergence is mostly towards the
query optimization therefore different cost models are also considered in this algorithm.
The algorithm is trying to find an equivalent rewriting rather than a contained rewriting.
3.3. Critical Analysis
The CoreCover algorithm [39] is different from other query reformulation
algorithms in the following perspectives. Firstly, it is trying to find an equivalent
rewriting whereas all the other algorithms are finding a maximally-contained source-
specific rewriting of the query. Secondly, closed-world assumption is taken to find an
equivalent rewriting whereas all the other algorithms are taking open-world assumption.
Thirdly, reformulation stage of query processing has to guarantee an optimal plan for the
query. Bucket, MiniCon and Shared-Variable-Bucket algorithms are constructing the
buckets, and then taking the cartesian product of the buckets, to produce source-specific
rewritings. In Bucket algorithm, buckets constructed are large which causes a lot of
combinations to be computed and tested for the second phase. MiniCon and Shared-
Variable-Bucket algorithms prevent this deficiency. The MiniCon algorithm has been
shown to outperform both the Bucket and the Inverse-Rules algorithms [21]. Inverse-
19
Rules algorithm is query independent. The rules are computed once and are applied to all
queries. These rules are easy extendable for functional dependencies [19]. This algorithm
ignores the predicates during rewriting and requires an additional phase to remove the
irrelevant views, added to the algorithm [21]. None of the algorithm pays attention to fast
and efficient traversal of source descriptions. As number of sources grows, there
metadata information also grows. How to reduce the search space of metadata in the
process of relevance reasoning to make this whole process more efficient? This
ultimately leads to scalable data integration systems where sources can join and leave the
system arbitrarily and the query execution engine can synchronize itself with any change
and submits the sub-query to the relevant and available data sources. Another deficiency
of these algorithms is that most of them are using relational models for source
descriptions whereas the ontology based models can help us to represent fine-grained
distinctions between the contents and capabilities of the different data sources. This fine-
grained distinctions can help us reason about the data sources in a more precise and
efficient manner
In a nutshell, we have discussed state of the art algorithms, used for query
reformulation in data integration systems. These algorithms are analyzed and compared
with each other. The features and deficiencies of these algorithms are also illustrated.
20
PROPOSED ARCHITECTURE
In order to execute a user’s query in a scalable data integration system proposed in
[8], the query execution process needs to be optimized. We have proposed an ontology-
driven relevance reasoning architecture to improve response time for user query during
relevance reasoning. This chapter is organized into three major sections. In the first
section, components of the proposed relevance reasoning architecture are discussed. The
second section of the chapter explains the semantic matching process and proposed
scoring strategy. Finally the proposed methodology for relevance reasoning is discussed
in details and elaborated through an example.
13 PROPOSED ARCHITECTURE FOR THE RELEVANCE REASONING
This section presents the proposed architecture designed for relevance reasoning for
source selection in a data integration system. The proposed architecture, as shown in Fig.
5, comprises of different components. These are described as follows.
4.1.1. Global Ontology: The global ontology is a knowledge-base in the proposed
architecture. This helps in generating user queries and enabling semantic inference.
Major components of the global ontology are: (1) domain knowledge, represents domain
of discourse in the form of RDF triples. Each RDF triple is uniquely identified by the
global unique identifier (GUID). GUIDs are used in semantic indexing scheme for
relevance reasoning; (2) concepts and relationships hierarchies, represents semantic
relationships among concepts and relationships respectively. These hierarchies help in
resolving semantic heterogeneities that exist in a domain; (3) rule-base, a rule is an object
that can be applied to deduce inference from RDF triples. Every rule is identified by its
21
CHAPTER 4
name and consisted of two parts. (a) An antecedent, which is known as body of the rule
and (b) a consequent which is known as head of the rule. The rule-base is an object that
consists of rules; (4) rules-index, computes and maintains deduced inferences by applying
a specific set of rule-bases in order to optimize reasoning.
4.1.2. Ontology Management Service: Ontology management service facilitates the
creation and maintenance of the global ontology. It provides a set of application program
interfaces (APIs) to perform the following functionalities: (1) publishes the domain
knowledge in the form of RDF triples by assigning GUIDs to the RDF metadata triples
and mapping GUIDs over the bitmap index; (2) defines semantic operators and constructs
concept and relationship hierarchies; (3) provides a mechanism to create and drop a rule-
base and modifies the set of rules from a rule-base; (4) enables the creation and
maintenance of the rules-index and synchronizes it after rules are modified into the rule-
base.
22
4.1.3. Source Descriptions Storage (SDS): Source description is the metadata of a data
source. This metadata can be further classified into source metadata and content
metadata. In order to make source description of a data source interoperable in a
heterogeneous environment, they are described in a conceptual model in the form of a
local ontology [8]. The metadata of a data source is expressed as RDF triples in the local
ontology. These RDF triples are assigned local unique identifiers (LUIDs) using a
sequence generating object of each data source. In a nutshell, we can say that source
descriptions storage is a set of local ontologies.
4.1.4. Source Registration Service: Source registration service facilitates the creation
and maintenance of a local ontology for a data source in the source description storage. It
provides a set of application program interfaces (APIs) to perform the following
functionalities: (1) creates a unique sequence number generating object for the incoming
23
Figure 5 Proposed Architecture for Relevance Reasoning in Data Integration Systems
data source, (2) creates a local ontology to hold the RDF triples advertised by a data
source, (3) registers the local ontology into the source description storage, (5) inserts the
RDF triples of the data source into its corresponding local ontology.
4.1.5. Bitmap Index Storage: A bitmap index is a cross-tab structure of bits [26, 28].
We employ bitmap index for efficient traversal during relevance reasoning. A bitmap
index is divided into bitmap-segments. Internally, data in the bitmap segment is
represented in the form of bits. Each data source retains one bitmap segment over the
bitmap index. In the proposed architecture, data sources are represented on vertical side
of the index whereas RDF triples of the global ontology are represented on horizontal
side of the index. A bit state is unset i.e., 0 if a data source does not contain the
corresponding RDF triple and is set i.e., 1 if a data source contains corresponding RDF
triple. A sequence number generating object is used to assign a unique identifier to each
bitmap segment.
4.1.6. Index Management Service: Index management service facilitates the creation
and maintenance of a bitmap segment for a data source in the bitmap index storage. It
provides a set of application program interfaces (APIs) to perform following
functionalities: (1) bitmap segment creation creates the bitmap segment for an incoming
data source and initializes all bits of the bitmap segment to 0 (means unset); (2) bitmap
synchronization updates the bitmap segment of a data source consistent against its local
ontology; (3) shuffle bit shuffles the bits of a bitmap segment during synchronization.
4.1.7. Index Lookup Service: Index lookup service facilitates an efficient traversal of
the bitmap index. It provides a set of application program interfaces (APIs) to perform
following functionalities: (1) relevant source identification traverses the bitmap index
24
against the RDF triple and identifies the bitmap segments where the bit is set; (2)
irrelevant source pruning traverses the bitmap index against the RDF triple and identifies
the irrelevant bitmap segments where the bit is unset.
4.1.8. Ontology Reasoning Service: Ontology Reasoning Service enables the reasoning
and inference capabilities to the proposed architecture. It provides a set of application
program interfaces (APIs) to perform the following functionalities. (1) Semantic
Matching: is the process of finding semantic similarity among the different terms
(concepts and relation-ships) in order to resolve the semantic heterogeneities. (2)
Inference and Reasoning: provides reasoning and inference to the semantic matching
process by incorporating rules, rules-base, and rules-index. (3) Semantic Query
Generation: generates queries against the global ontology using semantic operators
during the semantic matching. Note that these queries are different from the user query so
these should not be inter-mixed or confused.
4.1.9. Relevance Reasoning Service: Relevance reasoning service identifies relevant
and effective data sources for a query using index lookup service from bitmap index. It
provides a set of application program interfaces (APIs) to perform following
functionalities. (1) Semantic query expansion expands a user query to semantically
relevant RDF triples. (2) Relevance reasoning identifies relevant and effective data
sources for a given user’s query. (3) Relevance ranking ranks the data sources for a given
user query based on the semantic similarity score obtained.
14 Semantic Matching & Source Ranking for RDF Triples
4.1.10. Relevance Levels and Proposed Scoring Strategies: During the semantic
matching, the terms of user’s query triples are matched with the terms of source triples.
25
As a result one of the five relevance levels can be obtained for each term. These
relevance levels are given numeric scores for the purpose of quantification that will help
us to rank a source for a given query. Following is the definition and explanation of the
relevance levels and operators used in semantic matching process.
4.1.10.1. Exact Matching ( ): A term is exact match of another term if and only if
both are lexically equal to each other. For example a term nust:Instructor is an exact
match of niit:Instructor. A numeric score of 1 is assigned to any exact matching terms as
soon as they appear in RDF triple.
4.1.10.2. Synonym Matching ( ): It is unrealistic to assume that same name will
be used for a concept in a domain. An explicit specification of synonyms using some
operator is required. Therefore synonyms are the terms that are different lexically but
have the same meaning. For example a term nust:Instructor is synonym of the another
term niit:Teacher. A numeric score of 0.8 is assigned to any synonym matching terms as
soon as they appear in RDF triple. We are using owl:sameAs operator for specifying
mappings in the rule-base of the global ontology.
4.1.10.3. Subclass Matching ( ): In some scenarios taxonomies might be used for
the purpose of knowledge representation where generic concepts subsume specific
concepts. In order to cope with subsumption relationship, some operators are required for
explicit specification. Therefore a term is a subclass of another term if and only if it is
subsumed by that term. For example nust:Employee might subsumes the niit:Instructor.
A numeric score of 0.6 is assigned to any sub class matching terms as soon as they appear
26
in RDF triple. We are using rdf:subClassOf operator for specifying mappings in the
rule-base of global ontology.
4.1.10.4. Degree of likelihood ( ): In some situations data sources might contain
concepts that are not totally disjoint or different rather they would be related to some
other term with some degree of likelihood. For example the term nust:Instructor might be
relevant to nust:TeacherAssistant with some degree of likelihood. This type of mappings
cannot be specified using previously defined operators. A numeric score of 0.5 is
assigned to any likelihood based similar terms as soon as they appear in RDF triple. We
are using owl:equivalentOf operator for specifying mappings in the rulebase of global
ontology.
4.1.10.5. Disjoint ( ): A term is disjoint from another term if and only if they are
different from each other. For example the term nust:Instructor is disjoint from
nust:Student. A numeric score of 0.0 is assigned any disjoint terms as soon as they appear
in any components of RDF triple. These relevance levels and their scoring strategies can
be summarized in Table 1 below:
1 exact match 1.02 sameAs 0.83 subClassOf 0.64 equivalentOf 0.55 disjointFrom 0.0
4.1.11. Term Similarity: We use the same semantic matching strategy for both concepts
and relation-ships. We have concept hierarchy and relation-ship hierarchy. Terms include
both concepts and relationship. We extract the relationship between the query and source
27
Table 1 Relevance levels and scoring strategy
terms using their respective hierarchies and then assign standard relevance score as
defined in the Table 1. An RDF triple contains the subject, predicate, and object. Subject
and object are considered as concepts thereby their similarity is computed using concepts
hierarchy whereas to calculate the predicate similarity, the relation-ship hierarchy is used.
4.1.12. RDF Triple Similarity: To calculate the relevance between user query and
source RDF triples, we combine both aspects of term similarity (i.e., concepts and
relation-ships). The overall RDF triple similarity can be calculated as shown in equation
1:
Where qT denotes the query triple and S denotes source triples. qt and st are query and
source terms respectively that are to be matched, Sim (qT,s) the overall similarity of a
single query triple for a given source. Here i and j represent ith and jth source RDF triples
and query triple terms respectively.
4.1.13. Source Ranking: A user query and source RDF triples are matched to find the
similarity of each query triple with data source triples. Once RDF triple similarity has
been computed, source score of the whole query is being computed using the formula
given in equation 2. Based on the score obtained for a query, data sources are ranked.
In the above equation, simsrc is the total score of a source (s) for a user query
(obtained by multiplying the similarity score of all query triples). qi denotes the query
triples and n denotes the total number of query triples.
28
15 Proposed Semantic Matching Methodology
This section discusses our proposed methodology for relevance reasoning to identify
the most relevant and effective data sources using a bitmap index. Our proposed
methodology can be divided into three main workflows. These workflows help to
understand the intricacies of the proposed architecture. Below is the detail discussion of
each workflow.
4.1.14. Ontology Management Workflow: Ontology management workflow manages
the global ontology in the architecture. Ontology management service plays a prominent
part in this workflow. Five major activities carried out by ontology management
workflow include:
Domain knowledge representation
Concept & relationship hierarchy representation
Rules & Rules-base management
Rules-index management
The Figure 6 shows all the activities that are performed during the ontology
management workflow using sequence diagram.
29
Figure 6: Sequence Diagram for Ontology Management Workflow
Domain knowledge representation is the registration of the RDF triples over the
global ontology. These RDF triples are stored in the global ontology and GUIDs are
assigned using a unique sequence number generator object. GUIDs are allocated
positions over the bitmap index. Transactions are permanently recorded to the global
ontology. The snippet in Figure 7 shows pseudo-code for insertion of RDF triple in the
global ontology. In the preceding chapter its implementation issues and details are
discussed.
30
Pseudo-Code for Domain Knowledge Registration
For each RDF triple of global ontologyAssign GUID to RDF tripleAdd RDF triple to the global ontologyExtend bitmap indexIncrease the length of bitmap pattern by oneAssign location to the RDF triple reserved over the bitmap indexPerform commit to apply changes persistently to global ontology
Concept & relationship hierarchy representation involves the definition of semantic
operators and then using these operators to build their respective hierarchies. These
operators include sameAS, equivalentOf, subclassOf, and disjointFrom, as explained in
the previous section. RDF triples are added to the global ontology to represent the
concept and relationship hierarchy. Bitmap index is not maintained for these RDF triples.
Rules & Rules-base management involves the creation of the rules-base and then
inserting rules into the rules-base. In order to reduce mappings among the hierarchies and
increase the inference capabilities of rule-base, two rules are inserted for each semantic
operator. These rules include InverseOf<operator> and TransitiveOf<operator>.
InverseOf<operator> rule tells the rule-base that if a terms A is related to another term B
with relation R, and then B is related to A using R-1. Fig. 8 shows the N3 representation
of the InverseOf rule for sameAs operator in the semantic web rule language.
TransitiveOf<operator> rule tells the rule-base that if a term A is related to another
term B with some relation R, and the same term B is further related to another term C
using the relation R, it implies that the term A is related to term C using the same relation
R. Fig. 9 shows the N3 representation of the TransitiveOf rule for sameAs operator in the
semantic web rule language.
31
Figure 7 Pseudo-code for RDF triple registration of global ontology
: Def-InverseOfSameAs@swrl(“(?x sameAs ?y) -> (?y sameAs ?x)”)
Figure 8 InverseOf SameAs rule inserted in the rule-base
: Def-TransitiveOfSameAs@swrl (“(?x sameAs ?y) (?y sameAs ?z) -> (?x sameAs ?z)”)
Figure 9 TransitiveOf SameAs rule inserted in the rule-base
Rules-index management involves the creation and management of the rules-index
for a rules-base. Once the rules are inserted into the rules-base, the corresponding rules-
index is refreshed to pre-compute RDF triples.
4.1.15. Source Registration Workflow: Source registration workflow registers the data
sources in the data integration system. Three major activities carried out by source
registration workflow include
Local ontology creation
Bitmap segment creation
Bitmap synchronization
Local ontology creation involves the creation of local ontology for incoming data
source, a unique sequence number generator object along with the insertion of RDF
triples over the created ontology. Source registration service plays a prominent part in
local ontology creation. Ontology is created for the incoming data source and is
registered with the source descriptions storage. The RDF triples, advertised by the data
source, are assigned unique identifiers (LUIDs) and are added to the local ontology.
Transactions are permanently recorded to the source descriptions storage. The snippet in
Figure 10 shows pseudo-code for local ontology creation and its RDF triples insertion. In
the preceding chapter its implementation issues and details are discussed.
32
Pseudo-Code for Local Ontology Creation
Creating ontology for incoming source in Source Descriptions StorageCreating unique sequence generator for incoming source RDF triplesAssign LUIDs to the RDF triplesAdd RDF triple to the local ontology in Source Descriptions StoragePerform commit to apply changes persistently to global ontology
Figure 10 Pseudo-code for RDF triple creation of local ontology
Bitmap segment creation involves the cloning of bitmap pattern and the creation of
bitmap segment for incoming data sources over the bitmap index. The index management
service plays a prominent role in bitmap segment creation. The bitmap pattern is stored
over the global ontology. It is cloned for the newly created bitmap segment. Initially all
the bits are initialized to unset i.e., 0. A unique identifier is assigned to the bitmap
segment and is added to the bitmap index. The snippet in Figure 11 shows pseudo-code
for local ontology creation and its RDF triples insertion. In the preceding chapter its
implementation issues and details are discussed.
Bitmap Synchronization involves plotting the RDF triples of a data source
consistently and correctly by shuffling the bits in its bitmap segment. The index
management service plays a prominent role by spawning a listener process that listens for
any invalidation (those changes in local ontology that are not propagated and plotted over
the bitmap index) in the source descriptions storage. If any invalidation is found, it starts
index synchronization. During synchronization RDF triples of the data source are fetched.
Every RDF triple is decomposed into terms (subject, predicate, and object) and given to
ontology reasoning service. The ontology reasoning service performs reasoning and
inference that helps the index management service to extracts GUIDs for the
corresponding RDF triple. The position of the GUIDs is identified over the bitmap index
33
Pseudo-Code for Bitmap Segment Creation
Check whether bitmap segment exists for the incoming source If (no) Clone bitmap pattern from global ontology RDF triples Initialize bits to zero (0) Assign a unique number to the bitmap segment Add bitmap segment to the bitmap index for incoming source Perform commit to apply changes persistently in index
Figure 11 Pseudo-Code for Bitmap Segment Creation
Pseudo-Code for Bitmap Synchronization
For each incoming RDF triple advertised by a data source Decompose RDF triple into its components Perform reasoning for semantic similarity Extract GUID for the corresponding RDF triple Identify its position over the bitmap index Fetch the bitmap segment for the data source Shuffle the bit to 1 at the corresponding position in the bitmap segment Perform commit to apply changes persistently in index
and the bits are shuffled accordingly. The snippet in Figure 12 shows pseudo-code for the
bitmap synchronization. In the preceding chapter its implementation issues and details are
discussed.
The Figure 13 shows all the activities that are performed during the source
registration workflow using sequence diagram.
34
Figure 12 Pseudo-Code for Bitmap Synchronization
Figure 13: Sequence Diagram for Source Registration Workflow
Pseudo-Code for Bitmap Synchronization
For each incoming RDF triple advertised by a data source Decompose RDF triple into its components Perform reasoning for semantic similarity Extract GUID for the corresponding RDF triple Identify its position over the bitmap index Fetch the bitmap segment for the data source Shuffle the bit to 1 at the corresponding position in the bitmap segment Perform commit to apply changes persistently in index
4.3.3 Relevance Reasoning Workflow: Relevance reasoning workflow includes the
steps that are carried out to identify the relevant and effective data sources for the user’s
query. Relevance reasoning service plays a prominent part in this workflow. It
incorporates with the index lookup service and ontology reasoning service during
relevance reasoning to perform the following activities.
Semantic Query Expansion
Source Selection
35
Source Ranking
The Figure 14 shows all the activities that are performed during the source
registration workflow using sequence diagram.
Semantic query expansion: A user submits the query in RDF which is passed to the
relevance reasoning service. The RDF triples that are entered by the user into a query are
called asserted query triples. A user can submit queries in global ontology terms as well
as local ontology terms of their underlying data sources. Relevance reasoning service
expands the user query to all possible combinations using ontology reasoning service.
Every term of the query triple is expanded using semantic operators for synonyms, lexical
36
Pseudo-code for Query Expansion in Relevance Reasoning
InferredTriplesList = ØFor each RDF triple in AssertedTripleList of user’s query Isolate subject, object, and property of current RDF triple Calculate semantic similarity and add relevant term for the subject of RDF triple Calculate semantic similarity and add relevant term for the property of RDF triple Calculate semantic similarity and add relevant term for the object of RDF triple Take Cartesian product of terms Populate InferredTriplesList of the Cartesian product Return InferredTriplesList
Figure 14: Sequence Diagram for Relevance Reasoning Workflow
variants, subsumption, and degree of likelihood. This expansion results in the addition of
some extra triples to the user query. These RDF triples are called inferred query triples.
The snippet in Figure 15 shows pseudo-code for the semantic query expansion. In the
preceding chapter its implementation issues and details are discussed.
Source Selection: Once the query is expanded with semantically relevant RDF triples,
the GUIDs are reconciled from the global ontology. GUIDs help to find out the position
of RDF triples over the bitmap index. These positions are passed to the index lookup
service which traverses the bitmap segments of each source at the corresponding
positions and identifies the data sources for which the bits are set. The snippet in Figure
16 shows pseudo-code for the source selection. In the preceding chapter its
implementation issues and details are discussed.
37
Figure 15 Pseudo-Code for Query Expansion in Relevance Reasoning Workflow
Pseudo-code for Source Selection in Relevance Reasoning
RelevantSourceList = ØFor each RDF triple in users query [Asserted + Inferred] Reconcile GUID for incoming RDF triple from global ontology Identify Bitmap location of the RDF triple using GUID Pass bitmap location to Index lookup service Traverse bitmap segments at corresponding location to identify relevant sources Add sources to RelevantSourceListReturn RelevantSourceList
Pseudo-code for Query Expansion in Relevance Reasoning
InferredTriplesList = ØFor each RDF triple in AssertedTripleList of user’s query Isolate subject, object, and property of current RDF triple Calculate semantic similarity and add relevant term for the subject of RDF triple Calculate semantic similarity and add relevant term for the property of RDF triple Calculate semantic similarity and add relevant term for the object of RDF triple Take Cartesian product of terms Populate InferredTriplesList of the Cartesian product Return InferredTriplesList
Source Ranking: The identified data sources are ranked according to their relevance
to the user query. Table 1 shows our scoring scheme. Initially term similarity is computed
for component of query RDF triple in a given source. Once term similarity is computed it
is used in the equation 1 to compute RDF triple similarity. Finally source similarity is
computed by equation 2 and sources are ranked according to the score obtained for a
given user query.
16 Explanation of Proposed Methodology using a Case Study
We are using a portion of the famous university ontology as an example. In the
scenario, we have a global ontology with name NUST_DB as shown in Figure 17, and
three data sources named EME_DB, MCS_DB, and NIIT_DB. The RDF triples of the
global ontology are shown in Table 2.
NUST_RDF_DATA
GUID RDF Triplesnust-1000001 < nust:Instructor, nust:isTeaching, nust:Course >nust-1000002 < nust:Instructor, nust:isAdvisorOf, nust:Student >nust-1000003 < nust:Student, nust:isRegisteredIn, nust:Course >nust-1000004 < nust:Student, nust:hasMajor, nust:Department >nust-1000005 < nust:Instructor, nust:worksIn, nust:Department >nust-1000006 < nust:TeacherAssistant, nust:isAssisting, nust:Course >
38
Figure 16 Pseudo-Code for Source Selection in Relevance Reasoning Workflow
Table 2 RDF triples of the Global Ontology
Figure 17 Snap shot of the Global Ontology
isTeaching
TeachingAssistant
Instructor
isAssisting
Department
Course
Student
worksIn
hasMajor
isAdvisorOf isRegisteredIn
The RDF triples of the global ontology forms basis for the bitmap indexing in our
proposed architecture. The pattern of the index can be illustrated as shown in Table 3.
Source-segment position-1 position-2 position-3 position-4 position-5 position-6
xxxxxxxxxxxxxx nust-1000001 nust-1000002 nust-1000003 nust-1000004 nust-1000005 nust-1000006
In order to manage concepts and relation-ship hierarchies, the semantic matching
operators defined are sameAs, equivalentOf, subclassOf, and disjointFrom. The concepts
like nust:Instructor is mapped with the concept niit:Lecturer using subClassof operator in
order to specify subsumption relationships. The term nust:Course is mapped with the
term nust:Subject using sameAs operator in order to specify synonyms and lexical
variants. Similarly nust:Instructor is mapped with nust:TeachingAssistant using
equivalentOf operator in order to specify degree of likelihood and so on. Relation-ship
hierarchies are also managed accordingly. These hierarchies can be illustrated as shown
in Figure 18.
Three local ontologies are being created for the data sources with the naming
convention like <DataSource>_RDF_Data. There are semantic heterogeneities between
39
Table 3 Structure of Bitmap Index
Bitmap Pattern
Figure 18 Concept & Relationship Hierarchies Managed using Semantic Operators over Global Ontology
isTeaching
Teaching Teaches
sameAs sameAs
TeachingAssistant
Student
subClassofisAssisting
Subject
Course
sameAs
Instructor
Professor Teache
r
Lecturer
Prof
subClassof
sameAs
subClassof
subClassof
ExactMatch
the contents of the data sources. Table 4 describes the RDF triples of the sources stored in
their respective ontologies.
EME_RDF_DATALocal Link-ID RDF Tripleseme-1011 < eme:Professor , eme:Teaches, eme:Subject >eme-1012 < eme: Professor, eme:Advises, eme:Student >eme-1013 < eme:Student, eme:RegisteredIn, eme:Subject >NIMS_RDF_DATALocal Link-ID RDF Triplesnims-2011 < nims:Teacher, nims:isAdvisorOf, nims:Student >nims-2012 < nims: Teacher, nims:WorksIn, nims:Department >nims-2013 < nims:Student, nims:hasMajor, nims:Department >NIIT_RDF_DATALocal Link-ID RDF Triplesniit-3011 < niit:Lecturer, niit:isTeaching, niit:Course >niit-3011 < niit:TeachingAssistant, niit:isAsssting, niit:Course >
The prefixes nust, niit, eme, and nims refer to URLs http://www.nust.edu.pk,
http://www.niit.edu.pk, http://www.nims.edu.pk, and http://www.eme.edu.pk respectively.
Once the local ontologies are being created, the index management service comes into play and
creates the bitmap segments in the bitmap index for the data sources and plots (synchronizes)
the RDF triples of the data sources in their respective bitmap segments. During synchronization,
index management service also resolves the semantic heterogeneities. The structure of the
bitmap index can be illustrated as shown in the Table 5.
Source-segment nust-1000001 nust-1000002 nust-1000003 nust-1000004 nust-1000005 nust-1000006
EME-DB 1 1 1 0 0 0NIMS-DB 0 1 0 1 1 0NIIT-DB 1 0 1 0 0 1
Suppose, a user query contains RDF triple i.e., <Instructor isTeaching Course>.
Relevance reasoning service decomposes this triple into its terms and creates three
buckets i.e., one for the subject, one for the property, and one for the object. Each term is
given to ontology reasoning service to calculate its semantic similarity in their respective
hierarchies to find relevant terms. The buckets are populated as shown in the Table 6.
40
Table 4 RDF triples of the data sources
Table 5 Structure of Bitmap Index after sources are registered
Table 6 Buckets created for the RDF triples
Semantic Operator Used
Subject Bucket for “Instructor”Terms Deduced
Property Bucket for “isTeaching”Terms Deduced
Property Bucket for “Course”Terms Deduced
exactMatch Instructor isTeaching CoursesameAs NULL Teaching, Teaches SubjectsubClassOf Professor, Prof, Lecturer, Teacher NULL NULLequivalentOf TeachingAssistant isAssisting NULL
The cartesian product of subject, property and object is taken to construct inferred
triple list. Table 7 shows their cartesian product.
Expansion of RDF triple using Ontology Reasoning Service<Instructor>, <isTeaching>, <Course><Instructor>, <Teaching>, <Course><Instructor>, <Teaches>, <Course><Instructor>, <isAssisting>, <Course>…. … ….<Instructor>, <isAssisting>, <Subject><Professor>, <isTeaching>, <Course><Professor>, <Teaching>, <Course><Professor>, <Teaches>, <Course>… … …<Professor>, <isAssisting>, <Subject><Prof>, <isTeaching>, <Course><Prof>, <Teaching>, <Course><Prof>, <isAssisting>, <Subject><Lecturer>, <isTeaching>, <Course><Lecturer>, <Teaching>, <Course><Lecturer>, <Teaches>, <Course>… … …<Teacher>, <Teaching>, <Course><Teacher>, <Teaches>, <Course>… … …<Teacher>, <isAssisting>, <Subject><TeachingAssistant>, <isTeaching>, <Course><TeachingAssistant>, <Teaching>, <Course><TeachingAssistant>, <Teaches>, <Course><TeachingAssistant>, <isAssisting>, <Course>… … …<TeachingAssistant>, <isAssisting>, <Subject>
In order to execute a query over the bitmap index, GUIDs are needed. The RDF triple
is rejected, if no GUID is available for it in the global ontology. In this example, GUID
nust-1000001 and nust-1000006 are fetched from the global ontology. These GUIDs are
passed to the index lookup service to identify relevant and effective data sources. The
41
Table 7 Inferred RDF triples for a user’s query triple
index lookup service traverses the bitmap index for only these GUIDs and returns all
bitmap segments where the bits are set i.e., EME-DB, and NIIT-DB.
In order to sort the data sources based on their relevance to the query triples, semantic
similarity scoring is incorporated as shown in Table 1. First the term similarity is
computed for the query triples with data source triples using the concept and relationship
hierarchies.
EME-DB scores 0.6 for matching subject of the query triple Instructor with subject of
the source triples Professor. The concept’s hierarchy returns subClassOf relationship
between these terms. Next properties of the query and source triples are matched and
scores 0.8 for matching the respective properties isTeaching and Teaches, because they
are connected by sameAs relationship. Finally object of the query and source triples are
matched which scores 0.8 for matching the respective objects Course and Subject.
NIIT-DB scores 0.6 for matching the subject of the query triple Instructor with the
subject of the source triple Lecturer. The concept hierarchy returns subClassOf
relationship for this match. This data source scores 1 for matching the property
isTeaching with query property isTeaching. Finally scores 1 for matching the respective
objects Course and Course. NIIT-DB also contains a triple that is relevant to the query
triple with some degree of likelihood i.e., nust-1000006.
The relevance of a data source for every query triple is calculated by putting the term
similarity scores into the equation 1 and is shown in Table 8.
Term Similarity
42
Table 8: Semantic Similarity Calculation of a Data Source for a User Query Triple
Relevant Data Source
GUIDs Source Similarity for Query Triple(qT)
sim (subject) sim (property) sim (object)
EME-DB nust-1000001 0.6 0.8 0.8 0.384
NIIT-DB nust-1000001 0.6 1 1 0.6nust-1000006 0.5 0.5 1 0.25
Finally, the overall similarity score of a data source for a user’s query is calculated by
using the equation 2 and is shown in Table 9. These sources are sorted and given to query
rewriting component.
Relevant Data Source
GUIDs Source Similarity for Query Triple(qT)
EME-DB nust-1000001: 0.384Total Source Similarity for User Query (simEME) (0.384)
NIIT-DB nust-1000001: 0.6nust-1000006: 0.25
Total Source Similarity for User Query(simNIIT) (0.85)
In a nutshell, we have explained our proposed architecture of relevance reasoning for
source selection in data integration. Different workflows are highlighted and semantic
matching methodology has been explained using a case study.
43
Table 9: Semantic Similarity Calculation of a Data Source for User Query
IMPLEMENTATION
This chapter discusses our implementation strategy and issues for the proposed
architecture. The first section of this chapter discusses in details the Oracle
implementation of the ontologies and RDF data. The second section discusses the
implementation details of our proposed architecture for the relevance reasoning.
17 RDF data/ Ontologies in Oracle Database
In Oracle Database2 10g Release 2, a new data model has been developed for storing
RDF and OWL data. This functionality builds on the recent Oracle Spatial Network Data
Model (NDM), which is the Oracle solution for managing graphs within the Oracle
Database. The RDF Data Model supports three types of database objects: model or
ontology (RDF graph consisting of a set of triples), rule-base (set of rules), and rule index
(entailed RDF graph).
5.1.1. RDF Data Model or Ontology: There is a single universe for all RDF data stored
in the database. All RDF triples are parsed and stored in the system under the MDSYS
schema as shown in Figure 19. An RDF triple (subject, predicate, and object) is treated as
one database object. A single RDF document that contains multiple triples, therefore,
results in many database objects.
RDF_MODEL$ is a system level table created to store information on all of the RDF
and OWL ontologies in a database. Whenever a new ontology is created, new
MODEL_ID is automatically generated for it. An entry is made into the RDF_MODEL$
table.
2 http://www.oracle.com/index.html
44
CHAPTER 5
The RDF_NODE$ table stores the VALUE_ID for text values that participate in
subjects or objects of statements. The NODE_ID is the same as the VALUE_ID.
NODE_ID values are stored once, regardless of the number of subjects or objects they
participate in. The node table allows RDF data to be exposed to all of the analytical
functions and APIs available in the core NDM.
The LINKS$ table stores the triples for all of the RDF models in the database.
Therefore, the MODEL_ID logically partitions the RDF_LINK$ table. Selecting all of
the links for a specified MODEL_ID returns the RDF network for that particular
ontology.
The RDF_VALUE$ table stores the text values, i.e. the Uniform Resource Identifiers
or literals for each part of the triple. Each text value is stored only once, and a unique
VALUE_ID is generated for the text entry. URIs, blank nodes, plain literals and typed
literals are all possible VALUE_TYPE entries.
45
Figure 19 Database Schema to store ontology in Oracle NDM
Blank nodes are used to represent unknown objects, and when the relationship
between a subject node and an object node is n-ary. New blank nodes are automatically
generated whenever blank nodes are encountered in triples. However, it is possible for
users to re-use blank nodes, for example when inserting data into a containers or
collections. The RDF_BLANK_NODE$ table stores the original names of blank nodes
that are to be reused when encountered in triples.
To represent a reified statement a resource is created using the LINK_ID of the triple.
The resource can then be used as the subject or object of a statement. To process a
reification statement, a triple is first entered with the reified statement’s resource as
subject, rdf:type as property and rdf:Statement as object. A triple is then entered for each
assertion about the reified statement. However, each reified statement will have only one
rdf:type to rdf:Statement associated with it, despite the number of assertions made using
this resource.
The Oracle RDF Data Model supports containers and collections. A container or
collection will have a rdf:type to rdf:container_name or rdf:collection_name associated
with it, and a LINK_TYPE of RDF_MEMBER.
Two new object types have been defined for RDF-modeled data. SDO_RDF_TRIPLE
serves as the triple representation of RDF data, whilst SDO_RDF_TRIPLE_S is defined
to store persistent data in the database. The GET_RDF_TRIPLE() function can be used to
return an SDO_RDF_TRIPLE type.
5.1.2. Rule-base: Oracle supplies both an RDF rule-base that implements the RDF
entailment rules, and an RDF Schema (RDFS) rule-base that implements the RDFS
entailment rules. Both rule-bases are automatically created when RDF support is added to
46
the database. It is also possible to create a user-defined rule-base for additional
specialized inference capabilities. For each rule-base, a system table is created to hold
rules in the rule-base, along with a system view of the rule-base. The view is used to
insert, delete and modify rules in the rule-base. Information about all rule-bases is
maintained in the rule-base information view.
For example, the rule that the head of department (HoD) is also a faculty member of
the department could be represented as follows:
('HeadofDepartRule', -- rule name
‘(?p :HoDOf ?d)’, -- IF side pattern
NULL, -- filter condition
‘(?p :FacultyMemberOf ?d)’, -- THEN side pattern
SDO_RDF_Aliases(MDSYS.RDF_Alias('', 'http://www.seecs.edu.pk/univontology/')))
In this case, the rule does not have a filter condition, so that the component of the
representation is NULL. Note that a THEN side pattern with more than one triple can be
used to infer multiple triples for each IF side match.
5.1.3. Rules Index: A rules index is an object containing pre-computed triples that can
be inferred from applying a specified set of rule-bases to a specified set of ontologies. If a
graph query refers to any rule-bases, a rule index must exist for each rule-base and
ontology combination in the query.
When a rule index is created, a view is also created of the RDF triples associated with
the index under the MDSYS schema. This view is visible only to the owner of the rules
index and to users with suitable privileges. Information about all rule indexes is
maintained in the rule index information view. Information about all database objects,
47
such as ontologies and rule-bases, related to rules indexes is maintained in the Rule Index
Datasets view.
5.1.4. Querying RDF Data: The SDO_RDF_MATCH function has been designed to
meet most of the requirements identified by W3C in SPARQL for graph querying. A Java
API is also provided for network representation and network analysis. Analysis
capabilities include the ability to find a path between two resources, or to find a path
between two resources when the links are of a specified type.
Use of the SDO_RDF_MATCH table function allows a graph query to be embedded
in a SQL query. It has the ability to search for an arbitrary pattern against the RDF data,
including inference, based on RDF, RDFS, and user-defined rules. It can automatically
resolve multiple representations of the same point in value space (e.g. “10” ^^xsd:Integer
from “10” ^^xsd:PositiveInteger).
18 Setting up the Stage for Implementation
The implementation of different components of the architecture is discussed in the
following subsections.
5.1.5. Enabling and Disabling the RDF Support in Database: Before using the RDF
support into a Oracle database, we need to enable this feature. A procedure named
CREATE_RDF_NETWORK() of the SDO_RDF package is used to enable RDF support
in the database. This procedure creates system tables and other database objects used for
RDF support. One must connect to the database as a user with DBA privileges in order to
call this procedure, and should call the procedure only once for the database. To remove
RDF support from the database, call the SDO_RDF.DROP_RDF_NETWORK
procedure. The following example enables the RDF support into the database.
48
Enabling the Semantic NetworkBEGIN SDO_RDF.CREATE_RDF_NETWORK('rdf_tblspace'); END;
5.1.6. Creating the Global Ontology: The table used to store the RDF triples of the
global ontology is shown below. The name of the table is GLOBAL_RDF_DATA.
Column Name Data type DescriptionGUID NUMBER GUID assigned to incoming RDF triple of the global
ontology.TRIPLE SDO_RDF_TRIPLE_S This column stores the subject, predicate, and object of the
RDF triple. TRIPLE_TYP VARCHAR2 This column distinguishes whether the RDF triple is a
rulebase(R) or metadata (M) triple.BIT_POS NUMBER If the RDF triple type is M, then this column stores the
position of the GUID over the bitmap index
A unique sequence generating object is used to assign GUIDs to the incoming RDF
triples. The example below shows the creation of the sequence generator object.
Creating the Sequence Generator for GUIDsCREATE SEQUENCE s_global_rdf_data_idSTART WITH 1000INCREMENT BY 1NOCACHE ORDER;
Once the global ontology table has been created, we then create the global
ontology using the CREATE_RDF_MODEL() procedure of the SDO_RDF package.
The example below creates the global ontology.
Creating the Global Ontology
BEGIN SDO_RDF.CREATE_RDF_MODEL('global_ontology', 'global_rdf_data', 'triple'); END;
This procedure adds the global ontology to the MDSYS.RDF_MODEL$ table. To
delete ontology, use the SDO_RDF.DROP_RDF_MODEL procedure.
5.1.7. Creating the Bitmap Index: The table used to store the bitmap segment is shown
below. The name of the table is BITMAP_INDX
49
Column Name Data type Description
SEGMENT_ID NUMBER A unique identifier assigned to bitmap segment created for an incoming data source.
SEGMENT_SOURCE URI This column stores the URI of the data sources.BITMAP_PATTERN VARCHAR2 This column stores the bits to represent RDF triples for a data
source.
A unique sequence generating object is created to assign segment identifiers to newly
created bitmap segments. The example below shows the creation of the sequence generator
object.
Creating Sequence Generator for Bitmap Segments
CREATE SEQUENCE s_bitmap_segment_idSTART WITH 1000INCREMENT BY 1NOCACHE ORDER;
5.1.8. Defining Semantic Operators and Creating Hierarchies: The semantic
operators like exactMatch, sameAs, equivalentOf, subClassOf have also been defined
over the global ontology. The following example shows the SQL to define sameAs
operator. The same syntax is used to define other operators.
Defining sameAs operatorINSERT INTO global_ontology_rdf_dataVALUES(s_global_rdf_data_id.NEXTVAL, SDO_RDF_TRIPLE_S(‘global_ontology’, 'http://www.niit.edu.pk/Research/Delsa/sameAs'
'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' 'http://www.w3.org/1999/02/22-rdf-syntax-ns#Property'));
Once the semantic operators have been defined, they are used to manage the
concepts and relationship hierarchies. The code in following example links the
concept Course with Subject using sameAs operator to represent synonyms.
Managing Hierarchies
INSERT INTO global_ontology_rdf_dataVALUES(s_global_rdf_data_id.NEXTVAL, SDO_RDF_TRIPLE_S(‘global_ontology’, 'http://www.niit.edu.pk/Research/Delsa/Course'
'http://www.niit.edu.pk/Research/Delsa/sameAs' 'http://www.niit.edu.pk/Research/Delsa/Subject'));
50
5.1.9. Creating Rules, Rule-base and Rule Index: In order to create a user defined
rulebase, CREATE_RULEBASE() procedure of the SDO_RDF_INFERENCE package is used. The
following example creates a rulebase for the global ontology with name global_ontology_rb.
Creating Global Ontology RulebaseBEGIN
SDO_RDF_INFERENCE.CREATE_RULEBASE('global_ontology_rb');END;
After creating the rule-base, rules can be added to it. To cause the rules in the
rule-base to be applied in a query of RDF data, one can specify the rule-base in the
call to the SDO_RDF_MATCH table function. Inverse and transitive rules have
been inserted for each semantic operator. The following example explains the
implementation of these rules for sameAs operator.
Inverse Rule for sameAs OperatorINSERT INTO mdsys.rdfr_global_ontology_rbVALUES('InverseOfSameAs',
'(?x :sameAs ?y)', NULL, '(?y :sameAs ?x)',SDO_RDF_ALIASES(SDO_RDF_ALIAS('','http://www.niit.edu.pk/Research/Delsa/')));
Transitive Rule for sameAs OperatorINSERT INTO mdsys.rdfr_global_ontology_rbVALUES('TransitiveOfSameAs', '(?x :sameAs ?y) (?y :sameAs ?z)', NULL, '(?x : s ?z)',
SDO_RDF_ALIASES(SDO_RDF_ALIAS('','http://www.niit.edu.pk/Research/Delsa/')));
Whenever rules are inserted, updated, or deleted from the rule-base, rules index must
be refreshed. The following example creates the rule index for the global ontology rule-
base.
Rules Index CreationBEGIN SDO_RDF_INFERENCE.CREATE_RULES_INDEX (
51
'rdfs_rix_global_ontology', SDO_RDF_Models('global_ontology'), sdo_rdf_rulebases('RDFS','global_ontology_rb'));END;
19 Implementation of the Proposed Architecture for Relevance Reasoning
The Figure 20 shows the packaged diagram of the proposed architecture for relevance
reasoning in a scalable data integration system. The remaining section discusses the
functionality provided by each of these packages along with a brief description.
5.1.10. PACKAGE Source_Registratrion_Service: This package manages local
ontologies for the incoming data sources. It provides two procedures for this purpose.
5.1.10.1. REGISTER_SOURCE(): This procedure accepts the name along with the contents
of a incoming data source and creates the local ontology for it in source description storage.
Parameter Name Data type Description
p_incoming_source VARCHAR2 Name of the incoming data source. This name must be unique.
p_list_of_triples TRIPLE_TAB_TYP List of triples expressing the contents and capabilities of the incoming data source.
52
Figure 20 Package Diagram of the Proposed Architecture for Relevance Reasoning
5.1.10.2. UNREGISTER_SOURCE(): This procedure accepts the name of the data source
and deletes the local ontology for it from the source description storage.
Parameter Name Data type Description
p_deleting_source VARCHAR2 Name of the data source to be deleted. This name must be unique.
5.1.11. PACKAGE Ontology_Management_Service: This package manages global
ontology. It provides three main procedures to perform various tasks.
5.1.11.1. REGISTER_GLOBAL_TRIPLE(): This procedure helps in publishing domain
knowledge in terms of the RDF triples. It assigns the GUID to incoming triple and reserves its
position on the bitmap index and adds it to the global ontology.
Parameter Name Data type Descriptionp_incoming_triple SDO_RDF_TRIPLE RDF triples describing the domain knowledge.
p_incoming_triple_type VARCHAR2 Type of the RDF triple
5.3.2.2. RECONCILE_GUID(): This function returns the GUID for the specified RDF triple.
It interacts with the ontology reasoning service to semantically expand RDF triple and identify its
GUID.
Parameter Name Data type Description
p_incoming_triple SDO_RDF_TRIPLE RDF triple for which GUID has to be identified.
5.3.2.3. IDENTIFY_BITMAP_POSITION(): This function accepts the GUID and returns the
bitmap position for the specified RDF triple.
Parameter Name Data type Descriptionp_incoming_triple_GUID NUMBER GUID of the RDF triple for which bitmap position has to
be identified.
5.3.3. PACKAGE Index_Management_Service: This package helps in the
management of the bitmap index in the proposed architecture. Following are the main
three procedures of this package.
53
5.3.3.1. MANAGE_BITMAP_PATTERN(): This procedure manages the bitmap pattern for
the index whenever domain knowledge is published in terms of the RDF triples.
Parameter Name Data type Descriptionp_incoming_triple_GUID NUMBER GUID of the RDF triple that has to be published in global
ontology.
5.3.3.2. CONSTRUCT_BITMAP_SEGMENT(): This procedure helps in the construction of
bitmap segment for the incoming data source. It assigns a unique identifier for each bitmap
segment. Initially all bits are initialized to 0 in the bitmap pattern.
Parameter Name Data type Description
p_incoming_source VARCHAR2 URI of the incoming data source for which the bitmap segment has to be created.
5.3.3.3. SYNCH_BITMAP_SEGMENT(): This procedure helps in the
synchronization of the local ontology RDF triples with the bitmap segment for a specified
data source. It shuffles the bits accordingly to the RDF triples of the local ontology.
Parameter Name Data type Descriptionp_source_segment VARCHAR2 Unique identifier assigned to the bitmap segment of the data
source.GUID_POS NUMBER Position of the bit on the bitmap segment that need to be shuffled.
BIT_STATE VARCHAR2 SET means 1, and UNSET means 0.
5.3.4. PACKAGE Index_Lookup_Service: This package traverses the bitmap
segments in the index for the specified RDF triple. It contains one function shown below.
5.3.4.1. TRAVERSE_BITMAP_SEGMENT(): This function accepts the position
and traverses the bitmap index on the specified position to identify those bitmap
segments where the bits are set.
Parameter Name Data type Description
GUID_POS NUMBER Position of the bit on the bitmap segment that needs to be traversed.
54
5.3.5. PACKAGE Ontology_Reasoning_Service: This package helps the architecture
to perform ontological inferencing and calculate the semantic similarity among different
terms. It contains the following functions.
5.3.5.1. GENERATE_SEMANTIC_QUERY(): This function extends the simple semantic
searching behaviour of the proposed architecture and formulates a semantic query that checks
for synonyms, lexical variants, and subclass operators along with the terms that are relevant
with some degree of likelihood.
Parameter Name Data type DescriptionP_incoming_term VARCHAR2 Term for which simple semantic query has to be generated.
5.3.5.2. GENERATE_SEMANTIC_QUERY_DOL(): This function extends the
simple semantic searching behavior and to the proposed architecture. It accepts a term
(Concepts or Relationship) and formulates a semantic query that checks for synonyms, lexical
variants, and subclass operators in their respective hierarchies over the global ontology.
Parameter Name Data type Description
P_incoming_term VARCHAR2 Term for which extended semantic query has to be generated.
5.3.5.3. FETCH_RELEVANT_TERMS(): This function executes the query that is to be
generated using GENERATE_SEMANTIC_QUERY() function and returns a list of relevant terms for
the term being reasoned.
Parameter Name Data type DescriptionP_incoming_term VARCHAR2 Terms for which semantic similarity has to be computed.
5.3.5.4. FETCH_RELEVANT_TERMS_DOL():This function executes the query
that is to be generated using GENERATE_SEMANTIC_QUERY_DOL() function and
returns a list of relevant terms for the term being reasoned.
5.3.6. PACKAGE Relevance_Reasoning_Service: This package accepts the RDF
triples of a user query and identifies the most effective and relevant data sources.
55
5.3.6.1. IDENTIFY_RELEVANT_SOURCES(): This function interacts with the ontology
reasoning service and draw inference from it to expand the query triples. It also interacts with
the index lookup service to identify the most effective and relevant data sources for these
inferred RDF triples.
Parameter Name Data type Description
p_incoming_subject VARCHAR2 Subject of the query RDF triples
p_incoming_property VARCHAR2 Property of the query RDF triples
p_incoming_object VARCHAR2 Object of the query RDF triples
5.3.6.2. IDENTIFY_RELEVANT_SOURCES_DOL(): This function interacts
with the ontology reasoning service and draw inference based on degree of likelihood
from it to expand the query triples. It also interacts with the index lookup service to
identify the most effective data sources that are also relevant with certain degree of
likelihood.
5.3.6.3. RANK_RELEVANT_SOURCE(): This functions ranks the selected data sources
based on the score being obtained for the user’s query.
Parameter Name Data type Description
p_incoming_source VARCHAR2 Relevant data source that are to be ranked
p_ranking_order VARCHAR2 DESC/ASC means descending/ascending
We have highlighted the Oracle implementation of the ontologies and RDF data.
The design and implementation along with their issues have been discussed in detail for
the proposed architecture.
56
RESULTS AND EVALUATION
In this chapter we evaluate the results of the developed prototype system, discussed in
Chapter 5. We identify main evaluation criteria, the details of data set, the query
structure, system specification and results of the experiments carried through the system.
20 System Specification
Pentium-IV
System Processor 2.4GBRAM 1GBHDD 80GBOperating System Windows 2003 (with service pack 2)Tool Oracle Spatial 10g Release 2 NDM Language PL/SQL
21 Evaluation Criteria
The main aim of this evaluation is to validate whether the proposed architecture for
the relevance reasoning can scale up to a large number of data sources and complex
queries. In order to quantitatively measure the performance of the relevance reasoning,
different evaluation measures have been used which are discussed in the subsequent
section. The evaluation criteria for evaluating our system are listed below:
6.1.1. Response Time of Query Execution: to ensure that the manipulation of RDF
triples does not mitigate query response time during relevance reasoning as the number of
sources increases for the system.
6.1.2. Accuracy of the Relevant Source Selection: to ensure that provision of
semantics does not affect the accuracy of the proposed methodology and can be checked
57
CHAPTER 6
by calculating precision and recall of the system for relevance reasoning. Precision can be
defined as the ratio of relevant data sources to the number of retrieved data sources [41]:
Whereas Recall can be defined as the proportion of relevant data sources that are
retrieved [41]:
22 Data Specification
The experiment has been carried out with a corpus of manually generated 100 data
sources. Each data source contains 30-50 RDF triples. The famous university ontology
has been used in the experiment as the domain ontology [1, 42].
23 Test Queries
We have executed 35 different queries related to the students, faculty, and research
associates data. We performed accuracy test of the proposed architecture over these test
queries. We comparatively analyzed our system with MiniCon algorithm [1], observing
the precision and recall of both the systems. Among these 35 queries, we selected 3
queries; having 3, 6, and 9 RDF triples to test the system efficiency by checking query
response time. These queries are as below:
Find name of all Instructors who are teaching a course to the same student, whom
they are advisors.
58
RDF Pattern of Query 1
(?instructor :isTeaching :Course) (?student :isRegisteredIn :Course) (?instructor :isAdvisorOf ?Student).
Find instructor-name, instructor-gender, and area of specialization of all instructors
whether they are in staff or students.
Find instructor-name, instructor-gender, and area of specialization of all instructors
whether they are in staff or student and student doesn’t have major department as
advisor working department.
24 Experiments for Response Time of Query Execution
In the experiment for evaluating the performance, we evaluated the system for query
response time from three dimensions. Firstly, queries were executed against the local
ontologies of data sources in the source description storage. We assessed the time taken
by the relevance reasoner to traverse local ontologies for relevant source selection.
Second, as our proposed methodology employs bitmap index where source descriptions
are mapped semantically in the bitmap segments as bits, we submitted the queries to
relevance reasoner using bitmap index and assessed the time taken using bitmap index.
Finally, we extended the bitmap index and implemented function-based indexing over it
59
RDF Pattern of Query 2
(? instructor :hasName ? name) (? instructor :hasGender ?gender) (? instructor :hasArea ? area)UNION(? student :isAssisting :Course) (? student :hasGender ? gender) (? student :hasMajor ? depart).
RDF Pattern of Query 3
((?instructor :hasName ?name) (?instructor :hasGender ?gender) (?instructor :hasArea ?area)UNION(?student :isAssisting ?Course) (?student :hasGender ?gender) (?student :hasMajor ?depart))MINUS(?instructor :isAdvisorOf ?student) (?student :hasMajor ?depart) (?instructor :hasWorkingDepart ?depart)
and then analyzed the performance of the system. Figure 21, 22, and 23 illustrates the
performance of the system with the 3 queries shown in the preceding section.
Figure 21 Time Complexity of System (Query with 3 Triples)
60
Figure 22 Time Complexity of System (Query with 6 Triples)
Figure 23 Time Complexity of System (Query with 9 Triples)
The observations showed that there is a comparative performance gain running
queries on source descriptions with bitmap index than running them directly to source
61
descriptions. While, significant performance gain observed while searching relevant
sources using extended bitmap index from both previously discussed approaches. Figure
24 shows a comparison of performance gain using extended bitmap index than the simple
bitmap index.
Figure 24 Performance gain of the system with respect to direct ontology traversal
25 Experiments for System Accuracy
In the experiment for evaluating the accuracy of the system, we have calculated the
precision and recall of our proposed methodology and made a comparison with the
MiniCon algorithm [1]. As MiniCon algorithm directly traverses the source descriptions,
therefore we did not implement it rather we used the same approach to develop the code
to traverse the local ontologies. As our proposed semantic matching process searches for
the synonyms, lexical variants, subclasses and degree of likelihood also therefore the
comparison showed an increase in both precision and recall with respect to MiniCon
Algorithm.
62
Figure 25 : Precision Vs Recall comparison of the proposed methodology with MiniCon algorithm
We have provided evaluation of the results of the developed prototype system in this
chapter. Different evaluation criteria have been identified for system evaluation. We have
compared the results of the prototype system with the existing systems. The comparison
showed that the system have better query response time and accuracy of source selection
compared to the existing systems.
63
CONCLUSION AND FUTURE DIRECTIONS
In this chapter we conclude the research thesis. It provides an analysis of results and
future directions where the thesis work can be extended. The chapter is of vital
importance because it provides a bird’s eye-view of the methodology and gives future
directions for new researchers.
26 Discussion
An exponential growth in online data sources due to advancements in information and
communication technologies (ICT) requires semantically-enabled robust and scalable
data integration. Keeping in view the cited objectives we have proposed an ontology-
driven relevance reasoning architecture that identifies the most effective and relevant data
sources for user’s query before executing it. In our proposed methodology, we plotted the
local ontologies of the data sources over the bitmap index. In spite of traversing the local
ontologies in relevance reasoning, we use bitmap index to perform the relevance
reasoning.
The proposed methodology has three workflows; (1) Ontology Management
Workflow, (2) Source Registration Workflow, and (3) Query Execution Workflow. This
division helps to understand the functionality of various components in the methodology
along with their inter-dependence on each other. The ontology management workflow
and the source registration workflow set the stage for relevance reasoning in the proposed
architecture.
The ontology management workflow publishes the domain knowledge in the form of
RDF in global ontology. It creates the concept and relationship hierarchies using the
64
CHAPTER 7
semantic operators. It also creates the rule-base to define rules and manage rules index to
perform inference and reasoning during the semantic matching process. Source
registration workflow manages the local ontologies of data source in the source
description storage. As the new sources enter and leave the system, index management
service synchronizes the bitmap index to reflect the new status of the source description
storage. In order to answer the queries precisely, bitmap index need to be
synchronized/updated with source description storage.
Query execution workflow takes the user’s query formulated in RDF triples and
identifies the most effective and relevant data sources for the given query. During
relevance reasoning queries are expanded using the inference drawn from the ontology
reasoning service. It calculates the semantic similarity between the query and source RDF
triples and identifies the relevant and effective data sources. Relevant data sources are
ranked based on the similarity score they obtained for the user query. The sorted list of
relevant and effective data sources are returned to the query rewriting component that
reformulates the queries for these relevant data sources.
27 Contributions of the Project
The first contribution of the proposed methodology is that it provides provision for
the semantic interoperability during the process of relevance reasoning. Semantic
operators are being introduced to sort out fine grained heterogeneities among the contents
of different data sources. It checks for exact matches, lexical variants, synonyms,
subclasses, and degree of likelihood during semantic matching. Ontology, rule-bases and
rules-indexes have used for semantic matching and inference during the relevance
65
reasoning. The accuracy tests of the system showed improved precision and recall than
MiniCon algorithm [1].
The second contribution of the proposed methodology is the provision for
optimization during relevance reasoning with the help of a bitmap index. Previously the
community was using the bitmap index for bulks of data management in the warehouses
of the relational models but we used bitmap index to represent the RDF models. The
bitmap index is used during relevance reasoning and improves this whole process by
traversing the plotted RDF data in an improved manner. The time complexity test showed
that bitmap indexing performs the relevance reasoning in a comparatively shorter time.
28 Future Direction
Currently our focus is on centralized bitmap indexing in data integration systems
where a single global ontology is presiding over some node and queries are reformulated
over it. As P2P DBMSs are evolving and data integration is also getting popular in these
domains, therefore in future this methodology can be extended to meet the requirements
of P2P data integration. Index partitions may be residing on each peer and collectively
they all will participate in relevance reasoning during the query processing.
66
REFERENCES
[1] Alon Halevy, Anand Rajarman, Joann Ordille. Data Integration: The Teenage years, Proceeding of 32nd international conference on VLDB, Pages 9-16, September 2006.
[2] Yaser A. Bishr. Overcoming the semantic and other barriers to GIS interoperability. International Journal of Geographical Information Science, 12(4):229{314, 1998.
[3] Thomas R. Gruber and Gregory R. Olsen. An Ontology for Engineering Mathematics. Proceeding of 4th International Conference on Principles of Knowledge Representation and Reasoning (KR 1994), pages 258-269, 1994.
[4] Tom R. Gruber. A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, pages 199-220, 1993.
[5] Natalya F. Noy. Semantic Integration: A survey of Ontology-Based Approaches. SIGMOND record, Vol. 33, pages 65-70, December 2004.
[6] Isabel F. Cruz and H. Xiao. The Role of Ontologies in Data Integration. Journal of Engineering Intelligent Systems: pages 245-252, December, 2005.
[7] M. Jamadhvaja, Twittie Senivgee. An Integration of Data sources with UML Class Models Based on Ontological Analysis. Pages 1-8, November 4, 2005, ACM, Bremen, Germany.
[8] S. Khan and F. Marvon, Identifying Relevant Sources in Query Reformulation. In the proceedings of the 8th International Conference on Information Integration and Web-based Applications & Services (iiWAS2006), Yogyakarta, Indonesia, December 2006.
[9] Wache, H., Vogele, T., et al., Ontology-Based Integration of Information — A Survey of Existing Approaches in The Seventeenth International Joint Conference on Artificial Intelligence, Seattle, Washington, USA, 2001.
[10] Arens, Y., Hsu, C.N., et al. Query processing in the SIMS information mediator. In readings in agents, Morgan Kaufmann Publishers Inc., pages 82-90, 1997, San Francisco USA.
[11] Mena, E., Illarmendi. OBSERVER: An approach for query processing in Global Information Systems based on Interoperation across Pre-existing Ontologies. IEEE, pages 19-21, 1996.
[12] F. Naumann, U.Leser, and J.C. Freytag. Quality-driven integration of heterogeneous information systems. 25th Proceeding of International Conference on VLDB, pages 447-458, Scotland, September 1999.
67
[13] Isabel F. Cruz, Huiyong Xiao, and Feihong Hsu. An Ontology-based Framework for Semantic Interoperability between XML Sources. In Proceedings of the 8th International Database Engineering and Applications Symposium (IDEAS), pages 217-226, July, 2004. IEEE Computer Society 2004.
[14] Nicola Guarino. Formal Ontology and Information Systems. In Proceedings of the 1st International Conference on Formal Ontologies in Information Systems (FOIS 1998), pages 3-15, 1998.
[15] Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal, pages 270-294, 2001.
[16] Alon Y.Halevy, Anand Rajaraman, Joann J.Ordille. Querying heterogeneous information sources using source descriptions. In the proceeding of the International conference on Very Large Databases (VLDB) 1996.
[17] Rachel Pottinger and Alon .HaLevy. MiniCon: A scalable algorithm for Answering Queries using views. VLDB Journal 2001.
[18] G. Wiederhold. Mediators in the architectures of future information systems. IEEE Computer, Pages 38-49, March 1992.
[19] J. Zhong, H. Zhu, et al. Conceptual graph matching for semantic search. In the proceedings of the 10th International conference on Conceptual Structures (ICCS), LNCS 2393, pages 92-106, Bulgaria, July 2002. Springer.
[20] A.H. Levy: Why Your Data Won’t Mix: Semantic Heterogeneity. ACM Queue 3, pages 50-58, 2005.
[21] RDF Primer. W3C Recommendation, 10th February 2004, http://www.w3c.org/RDF/
[22] Waris Ali, Sharifullah Khan, Global Query Generation over Diverse Data Sources Using Ontology. In 1st International Conference on Information and Communication Technologies, 9th June 2007, Bannu, N.W.F.P, Pakistan.
[23] Nicole Alexander, Siva Ravada. RDF Object Type and Reification in the Database. In the proceeding of 22nd Int. Conference on Data Engineering (ICDE’06). IEEE Computer Society 2006.
[24] R. Smith, T. Connolly, Data Integration Service, Book Chapter, Information management in Large Scale Enterprises. 3rd Edition.
[25] Mediator-Wrapper, http://www.objs.com/survey/wrap.htm
68
[26] S. Khan, F. Movan, Scalable Integration of Biomedical Sources, In the proceedings of the 8th International Conference on Information Integration and Web-based Applications & Services (iiWAS2006), Yogyakarta, Indonesia, December 2006.
[27] Jacob Kohler, Stephan Philippi, Michael Specht, Alexander Rueggd, Ontology based text indexing and querying for the semantic web? Knowledge-Based Systems 19 (2006), pages 744-754.
[28] X. Li, F. Bian, H. Zhang, C. Diot, R. Govindan, G. Iannaccone. "MIND: A Distributed Multi-Dimensional Indexing System for Network Monitoring". IEEE Infocom 2006 Barcelona April 06.
[29] XML Vocabulary Description Language 1.1 XML Schema, W3C Recommendation May 2001, http://www.w3.org/XML/Schema
[30] The DARPA Agent Markup Language Home Page. August 2000, http://www.daml.org/
[31] Web Ontology Language, W3C Recommendation, 06 September 2007. http://www.w3.org/2004/OWL/
[32] B-Tree and Bitmap Indexing. Oracle Developer Guide 10g Release 2, Part no: A969505-01, Oracle Corporation, March 2002.
[33] Jena – A semantic web framework for Java, http://jena.sourceforge.net/
[34] Kowari meta store for OWL and RDF metadata, http://www.kowari.org/
[35] Jose Kahan, Marja-Riitta, Eric Prud’Hommeaux, Ralph R. Swick. Annotate: An Open RDF Infrastructure for Shared Web Annotations, Proceedings of the WWW 10th Int. Conf., Hong Kong, May 2001.
[36] A web-based RDF browser, Longwell, http://simile.mit.edu/wiki/Longwell
[37] Oracle Semantic Technologies Network, Spatial Technology using Network Data Model, http://www.oracle.com/technology/tech/semantic_technlogies/index.html.
[38] P. Mitra. Algorithms for Answering Queries Efficiently Using Views. Technical report, Infolab, Stanford University, September 1999.
[39] F. N. Afrati, C. Li, and J. D. Ullman. Generating Efficient Plans for Queries Using Views. In ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, May 2001.
69
[40] E. I. Chong, S. Das, G. Edon, J. Srinivasan. An Efficient SQL based RDF Querying Scheme, Proceedings of the 21st VLDB Conference, Trondheim, Norway, 2005.
[41] Giannis Varelas, Epimenidis Voutsakis, Paraskevi Raftopoulou, “Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web”, 7th ACM international workshop on Web information and data management November 5, 2005.
70
71