cs.bzu.edu.pk · web viewa variety of rdf storage systems and browsers are available such as jena...

TOPIC

By

XYZ

Supervisor

Dr

A thesis submitted in partial fulfillment of

The requirements for the degree of

Masters in Computer Science

In

Department of Computer Science

Pakistan

(July 2018)

APPROVAL

It is certified that the contents and form of thesis entitled “” submitted, have been found

satisfactory for the requirement of degree.

Advisor: __________________

Committee Member: _________________



II

IN THE NAME OF ALMIGHTY ALLAH

THE MOST BENEFICENT AND THE MOST MERCIFUL

TO MY PARENTS,

BROTHER AND SISTERS

III

CERTIFICATE OF ORIGINALITY

I hereby declare that this submission is my own work and to the best of my knowledge it

contains no materials previously published or written by another person, nor material which to a

substantial extent has been accepted for the award of any degree or diploma at BZU or at any

other educational institute, except where due acknowledgement has been made in the thesis. Any

contribution made to the research by others, with whom I have worked at BZU or elsewhere, is

explicitly acknowledged in the thesis.

I also declare that the intellectual content of this thesis is the product of my own work, except for

the assistance from others in the project’s design and conception or in style, presentation and

linguistics which has been acknowledged.

Author Name:

Signature: ______________

IV

ACKNOWLEDGEMENTS

First of all I am extremely thankful to Almighty Allah for giving me courage and strength to

complete this challenging task and to compete with international research community. I am also

grateful to my family, especially my parents who have supported and encouraged me through

their prayers that have always been with me.

I am highly thankful to for his valuable suggestions and continuous guidance throughout my

research work. His foresightedness and critical analysis of things taught me a lot about valuable

research which will be more helpful to me in my practical life.

I would like to offer my gratitude to all the members of the research group and my close

colleagues who have been encouraging me throughout my research work especially Mr Maruf

Pasha.

V

TABLE OF CONTENTS

List of Figures VIII

List of Tables VIII

List of Abbreviations X

ABSTRACT XI

CHAPTER 1 1

INTRODUCTION 1

1.1. Motivation 1

1.2. Problem Definition 2

1.3. Objective and Goals of Research 3

1.4. Outlines of Thesis 4

CHAPTER 2 5

BACKGROUND STUDIES 5

2.1. Data Integration 5

2.2. Issues in data integration 6

2.3. Approaches to Data Integration 7

2.4. Query Processing in Data integration 9

2.5. Ontology 10

2.6. Indexing 13

CHAPTER 3 17

LITERATURE SURVEY 16

3.1. Query Reformulation 16

3.2. State of the art techniques 16

VI

CHAPTER 4 23

PROPOSED ARCHITECTURE 21

4.1. Proposed Architecture for the Relevance Reasoning 21

4.2. Semantic Matching & Source Ranking of RDF Triples 25

4.3. Proposed Semantic Matching Methodology 28

4.4. Explanation of Proposed Methodology using a Case Study 37

CHAPTER 5 43

IMPLEMENTATION 43

5.1. RDF data/ Ontologies in Oracle Database 43

5.2. Setting up the Stage for Implementation 47

5.3. Implementation of the Proposed Architecture for Relevance Reasoning 51

CHAPTER 6 58

RESULTS AND EVALUATION 57

6.1. System Specification 57

6.2. Evaluation Criteria 57

6.3. Data Specification 58

6.4. Test Queries 58

6.5. Experiments for Response Time of Query Execution 59

6.6. Experiments for System Accuracy 62

CHAPTER 7 64

CONCLUSION AND FUTURE DIRECTIONS 64

7.1. Discussion 64

7.2. Main Contribution of the Project 65

7.3. Future Direction 66

REFERENCES 67

VII

LIST OF FIGURES

Figure 1: Data Warehousing Architecture for Data Integration......................................................8

Figure 2 Mediator Wrapper Architecture for data integration.........................................................9

Figure 3 RDF Triple as Directed Graph........................................................................................12

Figure 4: Structure of a bitmap index............................................................................................15

Figure 5 Proposed Architecture for Relevance Reasoning in Data Integration Systems..............22

Figure 6: Sequence Diagram for Ontology Management Workflow............................................29

Figure 7 Pseudo-code for RDF triple registration of global ontology...........................................30

Figure 8 InverseOf SameAs rule inserted in the rule-base............................................................30

Figure 9 TransitiveOf SameAs rule inserted in the rule-base........................................................31

Figure 10 Pseudo-code for RDF triple creation of local ontology................................................32

Figure 11 Pseudo-Code for Bitmap Segment Creation.................................................................32

Figure 12 Pseudo-Code for Bitmap Synchronization....................................................................33

Figure 13: Sequence Diagram for Source Registration Workflow................................................34

Figure 14: Sequence Diagram for Relevance Reasoning Workflow.............................................35

Figure 15 Pseudo-Code for Query Expansion in Relevance Reasoning Workflow......................36

Figure 16 Pseudo-Code for Source Selection in Relevance Reasoning Workflow.......................37

Figure 17 Snap shot of the Global Ontology.................................................................................37

Figure 18 Concept & Relationship Hierarchies Managed using Semantic Operators over Global

Ontology........................................................................................................................................38

Figure 19 Database Schema to store ontology in Oracle NDM....................................................44

Figure 20 Package Diagram of the Proposed Architecture for Relevance Reasoning..................51

Figure 21 Time Complexity of System (Query with 3 Triples)....................................................60



Figure 24 Performance gain of the system with respect to direct ontology traversal....................61

Figure 25 : Precision Vs Recall comparison of the proposed methodology with MiniCon

algorithm........................................................................................................................................62

VIII

List of Tables

Table 1 Relevance levels and scoring strategy..............................................................................27

Table 2 RDF triples of the Global Ontology.................................................................................38

Table 3 Structure of Bitmap Index................................................................................................38

Table 4 RDF triples of the data sources.........................................................................................39

Table 5 Structure of Bitmap Index after sources are registered....................................................39

Table 6 Buckets created for the RDF triples.................................................................................39

Table 7 Inferred RDF triples for a user’s query triple...................................................................40

Table 8: Semantic Similarity Calculation of a Data Source for a User Query Triple...................42

Table 9: Semantic Similarity Calculation of a Data Source for User Query.................................42

IX

LIST OF ABBREVIATIONS

XML Extensible Markup Language

WWW World Wide Web

DAML DARPA Agent Markup Language

OWL Ontology Web Language

API Application Programming Interface

DIS Data Integration Systems

NDM Network Data Model

RDF Resource Description Framework

W3C World Wide Web Consortium

URL Uniform (Universal) Resource Locator

ICT Information and Communication Technologies

AI Artificial Intelligence

UMLS Unified Medical Language System

IM Information Manifold

GUID Global Unique Identifier

LUID Local Unique Identifier

SDS Source Description Storage

X

ABSTRACT

Online data sources are autonomous, heterogeneous and geographically distributed. The

data sources can join and leave a data integration system arbitrarily. Some sources may not

contribute significantly to a user query because they are not relevant to it. Executing queries

against all the available data sources consume resources unreasonably and subsequently these

queries become expensive.

Source selection is an approach to resolve the issue. The existing techniques of relevance

reasoning for source selection take significant time in traversing the source descriptions.

Consequently query response time degrades in coping with the growing number of available

sources. Moreover, simple matching process is unable to sort out the fine-grained semantic

heterogeneities of data. Semantic heterogeneity of data sources makes the relevance reasoning

complex. These issues degrade the performance of data integration systems.

In this research, we have proposed an ontology-driven relevance reasoning architecture

that identifies relevant data sources for a user query before its execution. The proposed

methodology aligns source descriptions (i.e. local ontologies) with domain ontology through a

bitmap index. In spite of traversing the local ontologies, the methodology utilizes the bitmap

index to perform relevance reasoning in order to improve query response. Semantic matching has

been employed in relevance reasoning for the provision of semantic interoperability. Semantic

operators, such as, exactMatch, sameAs, equivalentOf, subClassOf, and disjointFrom, have been

introduced to sort out fine-grained semantic heterogeneities among data sources. Quantitative

scores are assigned to the operators. Data sources are ranked based on the similarity score

obtained by them.

XII

A prototype system has been designed and implemented to validate the methodology. The

evaluation criteria used are (a) query response time and (b) accuracy of relevant source selection.

The prototype system has been compared with the existing systems for evaluation. Query

response time and accuracy of source selection, in terms of precision and recall; have been

improved due to the incorporation of a bitmap index and ontology respectively.

XIII

INTRODUCTION

This chapter introduces the research work that has been taken in this thesis. It

includes motivation and definition of the problem. Moreover the objectives and goals

have also been discussed.

1 Motivation

The exponential growth in data sources on the Internet is due to advancements in

information and communication technologies (ICT). Some data sources contain

interrelated data that could answer a user query. Retrieving data from these interrelated

data sources is a non trivial task due to their properties i.e. autonomy, heterogeneity and

geographical distribution [1, 8, 11, 23]. The sources can be heterogeneous in terms of

syntax, schema, or semantics. The task of a data integration system is to enable the

interoperation of autonomous and distributed data sources for knowledge discovery

through a centralized access point. It provides a uniform query interface that gives a user

transparent access for querying data sources. However the properties, discussed above,

make integration among the sources a pervasive challenge and a crucial task [1, 8, 23].

A variety of approaches to data integration exists. These approaches can be generally

classified into two major categories: (a) data warehousing and (b) mediation [1, 28]. In

data warehousing, the required data is extracted from the sources and stored in a

centralized repository after integration. While in mediation, data is gathered and

integrated when a user query is submitted. Query execution is efficient and response

time is predictable in warehousing, but result is stale. In contrary, query efficiency is

slow in mediation but result is up to date [1, 21, 28].

1

CHAPTER 1

The growth of online data sources requires a scalable data integration system because

the sources are unpredictable due to their autonomy. In other words, data sources can join

and leave the system arbitrarily. Thus checking the availability of a data source before

executing a query is needed. Moreover all the data sources may not have the required

information. Executing a query on all data sources is an expensive solution due to the fact

that an available source may not contribute any significant information to the user query

result [8, 20, 23]. In order to execute queries efficiently in these systems, we need to

identify relevant and effective data sources that are available at the time of execution.

This research work focuses on relevance reasoning for identifying relevant and effective

data sources in a scalable data integration system.

2 Problem Definition

Identifying relevant sources in a scalable data integration system faces problems due

to semantic heterogeneity and lack of performance. We have highlighted in depth these

problems in the following paragraphs of this section.

Semantic Heterogeneity: Data sources are being developed by independent

organizations so there might be semantic differences between their schemas [20].

In different data sources, a same concept may be represented with different names

such as, instructor, teacher or lecturer. Similarly different concepts in different

data sources may be represented by same name such as bank i.e., a bank can be a

river bank or a financial institution.

Performance in Query Response Time: Some data sources may or may not

contribute significantly to a user query because they are not relevant. Executing a

2

query on all available data sources, without any estimation about their relevance

for a user query, degrades the performance of the query. This leads to

unreasonable wastage of resources of the data integration systems.

3 Objective and Goals of Research

The goal of this research is to provide a mechanism for relevance reasoning in a

scalable data integration system generally. In particular our objective is to work on

relevance reasoning in the following directions.

Provision of Semantic Interoperability in Relevance Reasoning: Ontology,

initially developed by artificial intelligence community for knowledge sharing

and reuse, is a formal, explicit specification of a shared conceptualization [5].

Ontology is largely used for representing domain knowledge and can play a vital

role in reconciling the semantic heterogeneities due to its representational and

expressive capabilities [3, 4]. In this research, we are exploiting the capabilities of

domain ontology for the provision of semantic interoperability to handle the

source heterogeneities during relevance reasoning.

Optimization of Relevance Reasoning Mechanism: Indexing structures are

used in databases to access data efficiently [27, 28]. We have proposed semantic

indexing using bitmap technique to represent the metadata of data sources. A user

query is executed through the bitmap index for identifying relevant data sources.

The index performs relevance reasoning in improved manner thereby enhances

query response time.

3

4 Outlines of Thesis

The rest of the document is organized as follows: Chapter 2 describes a data integration

system and its various components. RDF is also explained as a language for developing

ontologies, storing source descriptions and semantic mappings. Chapter 3 discusses

various algorithms for relevance reasoning and their critical analysis. Chapter 4

highlights the proposed system architecture, proposed semantic matching process along

with the proposed methodology for relevance reasoning. Chapter 5 gives complete

overview of implementation details. Chapter 6 highlights experimentation and

comparative analysis to validate the proposed architecture. Discussions are also made on

the conducted experiments. Chapter 7 concludes the thesis and defines future research

directions.

4

BACKGROUND STUDIES

This chapter provides background literature in order to understand the context of this

research. Data integration and semantic heterogeneity is discussed. The details of

ontology and its designing methodology; and indexing have also been included.

5 Data Integration

Data sources on the Internet are growing exponentially in size and number over the

time. These data sources contain information about different topics such as stock market,

product information, real estate, and entertainment. The data from these sources can be

used for answering complex user queries and this might go beyond the traditional

searches. Advancements in information and communication technology has enabled the

users to access a wide array of data sources that are related in some way and to integrate

the result to come up with useful information that might not be stored physically in a

single place [1, 8, 12, 24].

Data integration enables the interoperability of the data sources for knowledge

discovery through a centralized access point, and provides a uniform query interface that

gives user the illusion of querying a homogeneous system [2, 15, 19, 31]. In data

integration a user is provided with a unified interface for posing queries, which is based

on a schema typically referred as the global schema or mediated schema. Based on the

approach used to develop data integration systems, a user is provided with appropriate

result obtained from underlying data sources either from centrally materialized repository

or at real time.

5

CHAPTER 2

6 Issues in data integration

Data sources in data integration are maintained by different organizations, are located

geographically distributed, and managed autonomously. This scenario creates a variety of

barriers in integrating data from these participating data sources. Most common issues

include (a) autonomy, and (b) semantic heterogeneity. In order to achieve scalable data

integration these issues need to be sorted out.

2.1.1. Autonomy: In data integration, autonomy indicates the ability of data sources to

control their data and processing capabilities. The data sources retain their autonomy

even after becoming a part of data integration [24, 31]. This autonomous scenario arises

the following issues:

- The source data administrators might not be interested in, or may not have the

resources, to help the integrators to understand how their site's schema relates to

the schemas of other sites being integrated.

- The source data administrators might change their site's schema without

forewarning the integrators and can lead the integration software to make invalid

assumptions about the data source.

- The data source administrators might choose a schema that is very difficult to

integrate with the other schemas in the integrated system.

2.1.2. Semantic Heterogeneity: In data integration, heterogeneities come from different

programming and data models as well as from different conceptualization of a real world

object. Among these heterogeneities is the semantic heterogeneity [20]. A variety of

semantic heterogeneities can be found in the different data sources. A few of semantic

heterogeneities are:

6

2.1.2.1. Synonym: The same concept may be represented with different names in

different data sources e.g., Course, Subject.

2.1.2.2. Homonym: The different concepts in different data sources may be

represented by same name e.g., bear can be an animal or a property meaning tolerate.

2.1.2.3. Degree of likelihood: Two concepts can be relevant to each other on the

basis of degree of likelihood. This does not mean equality of concepts like synonyms,

rather relatedness e.g., <:Teacher :isTeaching :Course> and

<:TeachingAssistant :isAssisting :Course>, here teaching assistant and teacher is not

same concepts but are relevant to each other with certain degree of likelihood.

7 Approaches to Data Integration

A variety of approaches to data integration exists. These approaches can be generally

classified into two major categories: (a) data warehousing and (b) mediation.

2.1.3. Warehouse: In data warehousing, the required data is extracted from the sources

and stored in a centralized repository after integration [19, 24]. Users pose queries against

the data model of the warehouse. This approach is also known as eager approach or

materialized view approach to data integration. Query execution is efficient and response

time is predictable in this approach, but result can be stale mostly [1]. Figure 1 shows

data warehousing architecture [24].

7

2.1.4. Mediation: In mediation approach, a user is given a unified schema for posing a

query that contains virtual relations. Data is not loaded in a central repository in advance

in this approach rather queries are executed at run time [1, 19, 20, 24, 24]. In order to

answer a user query using the information sources, metadata is needed that describe the

semantic relationship between the elements of mediated schema and schemas of

underlying data sources. This metadata is known as source description. This approach is

also known as lazy approach or virtual view approach to data integration. Query

efficiency is slow in mediation but result is up to date [1, 21, 24]. Figure 2 depicts

mediation based architecture for data integration [24].

8

Figure 1: Data Warehousing Architecture for Data Integration

8 Query Processing in Data integration

The main objective of data integration is to facilitate access to a set of autonomous,

heterogeneous and distributed data sources. The ability to efficiently and correctly

execute a query over the integrated data lies in the heart of data integration. Main steps in

processing a query in data integration include (1) Query reformulation, (2) Query

planning and execution.

2.1.5. Query Reformulation: Query reformulation is the first step in query processing

where a user query previously written in terms of a mediated schema is reformulated

using information about sources into queries that refer directly to the schemas of

underlying data sources [1, 8, 10, 11, 19, 24]. Query reformulation is further divided into

two steps: (a) source identification (b) query rewriting.

2.1.5.1. Source identification: Before executing a user query, relevant and

effective sources should be clearly identified to optimize query execution. Relevance

reasoning is the process of identifying relevant sources and pruning irrelevant and

redundant data sources. The main focus of our research is to propose an algorithm that

can speed up the process of relevance reasoning.

9

Figure 2 Mediator Wrapper Architecture for data integration

2.1.5.2. Query rewriting: Once relevant sources are being identified then query

rewriting is performed and source specific queries are reformulated only for those sources

that have been found relevant and can contribute some result to the user’s query.

2.1.6. Query Planning and Execution: Query reformulation provides some

optimizations by pruning irrelevant sources and overlapping sources to avoid redundant

computation. The reformulated queries are evaluated using different strategies producing

multiple execution plans during the optimization [11, 12]. The query execution engine

executes these queries using best and cheapest execution plan and deals with limitation

and capabilities of the data sources [28]. During execution, an important issue is to

minimize time to return the first answers to the query rather minimizing the total amount

of work to be done to execute whole query [21, 24].

9 Ontology

Ontology is defined as an explicit and formal specification of a shared

conceptualization [3, 4, 15]. In this definition, the term conceptualization refers to an

abstract model of some domain knowledge that identifies relevant concepts of the

domain. The term shared indicates that ontology captures consensual knowledge that is

accepted by a group of people and systems. The term explicit means that concepts and the

constraints on these concepts are explicitly defined. Finally, the term formal means that

the ontology should be machine understandable [15]. Ontology was initially developed

by the Artificial Intelligence (AI) community to facilitate knowledge sharing and reuse.

Ontology carries semantics for a particular domain and hence used for representing

domain knowledge. Ontology is widely used in data standardization and

conceptualization. Ontologies have proven to be an essential element in many

10

applications including agent systems, knowledge management systems, and e-commerce

systems. They can also, generate natural language like queries, integrate intelligent

information, and provide semantic based access to the Internet [36]. Ontology can be a

taxonomy e.g., Yahoo categories or a domain specific standard terminology e.g., UMLS

and Gene Ontology or an online lexicon database e.g., Word Net.

Ontology consists of concepts, properties, and individuals. A concept is a thing of

significance in the real world. Concepts may be organized into super-class and subclass

hierarchy which is also known as taxonomy where subclasses specialize their super-

classes. Concepts in ontology can be synonyms or disjoint. Properties represent

relationships between two concepts. Properties may have a domain and a specified range.

Properties may be inverse, functional, transitive, or symmetric. Individuals represent

objects in the domain. Ontology needs a reasoner which can check whether or not all of

the statements and definitions in the ontology are mutually consistent and can also

recognize which concepts fit under which definitions. The reasoner can help to maintain

the hierarchy correctly.

2.2. Ontology Modeling Languages: In order to develop ontology-driven

applications, a language is needed to facilitate the semantic representation of the

information, required by these applications. A number of research groups have already

identified a need for a more powerful ontology modeling language. This need for a

powerful modeling language, leads to joint initiatives of building languages. Therefore, a

number of ontology modeling languages are available and are being used today [36].

Most common ontology modeling languages include XML Schema [35], DAML+OIL

[37], RDF and RDFS [25], and OWL [38]. Among all these ontology languages, we are

11

most interested in RDF and RDFS for their role in data integration and semantic web [4,

6, 25, 26].

2.2.1. RDF and RDFS: Resource Description Framework (RDF) is a standard -

developed by World Wide Web Consortium (W3C), for representing information about

resources. RDF provides interoperability across resources due to its simplistic structure.

RDF schema (RDFS) is a language for describing vocabularies of RDF data in terms of

primitives such as Class, Property, domain, and range. The machine-understandable

format of RDF facilitates the automated processing of web resources [5, 6, 26]. In RDF, a

pair of resources (nodes) connected by a property (edge) forms a statement: (resource,

property, value), often called an RDF triple. A set of triples is known as model or graph.

The components of a triple include a subject, a predicate or property, and object. Each

triple represents a complete and unique fact for a specific domain. It can be modeled as a

link in a directed graph as shown in Figure 3. The subject is the start node of the link and

the object is the end node of the link. The direction of the link always points towards the

object. A detailed description of RDF language can be found in [25].

Some of the important concepts of RDF are discussed below:

- A URI is a more generic form of Uniform Resource Locator (URL). It allows us to

locate a web resource without specific network address

(http://www.niit.edu.pk/delsa#Instructor).

- A blank node is used when either the subject or object of a triple are unknown or

relationship between the subject and object is n-ary.

12

Subject

Object

Figure 3 RDF Triple as Directed Graph

http://www.niit.edu.pk/delsa#Instructor

- A literal is a string which is used to represent names, dates, and numbers.

- A typed literal is a string combined with its data type

(e.g.“Smith”^^http://www.w3.org-/2001/XMLSchema#string).

- A container is a resource that is used to describe a group of things. Participants of a

container are members of the group. Blank nodes are usually used to represent

containers.

- Reification allows triples to be attached to other triples as properties. One of the major

issues is its representational complexity. Therefore it is sometimes termed as “The Big

Ugly”.

A variety of RDF storage systems and browsers are available such as Jena [33],

Kowari [34], Sesame [35], Longwell [36], and Oracle RDF Data Model [37, 40]. We

have used Oracle RDF Data Model for managing global ontology and source descriptions

because it is efficient in terms of storage and is not mitigated by slow performance times.

It provides a basic infrastructure for effectively managing RDF data in databases. At the

same time RDF data can be readily integrated, managed and analyzed with other

enterprise data. A comparative analysis of RDF [26] was conducted and shown that

oracle RDF data model outperforms other existing RDF storage systems.

10 Indexing

Databases spend a lot of their time in finding things. So the finding needs to be

performed as fast as possible to speed up the searching mechanism. Indexes provide the

basis for both rapid random lookups and efficient ordering of access to data. An index is

associated with some search key that is, one or more attributes of a relation for which the

index provides fast access. The disk space required to store an index is typically less than

13

the storage of the table. Indexes can be primary or secondary indexes. A variety of

indexing techniques are used today in modern DBMSs e.g., hash based indexing, cluster

indexing, tree-structured indexing, and bitmap indexing. The most efficient and compact

indexing techniques, that are dealing with bulks of data [26,28], includes (a) B+tree Index

(b) Bitmap Index. In this thesis we are using bitmap indexes due to their internal compact

representation for bulks of data.

2.2.2. Bitmap Index: A bitmap indexing is a specialized technique that is geared

towards easy querying based on multiple search keys. In bitmap index, attributes can be

stratified into relatively a small number of possible values and then queried based on that

stratification. Internally bitmap index entries have bitmap vectors of ‘0’s and ‘1’s. Figure

4 depicts the structure of bitmap index. Bitmap indexing can benefits applications where

ad-hoc queries are being executed on large amounts of data with a low level of

concurrent transactions [26, 28]. The purpose of using bitmap index in our approach is to

provide pointers to RDF triples for efficient searching. Normal indexing can also be used

to achieve this functionality by storing a RDF triple with each index entry but it

consumes more space than the bitmaps. In bitmap index, a single bitmap vector

represents the status of whole source. Each bit in a bitmap vector corresponds to an RDF

triple. If the bit is set, then it means that the source contains the corresponding RDF

triple. A mapping function is used that converts the bit position to an actual RDF triple.

So the bitmap index provides the same functionality as a regular index even though it

uses a different representation internally. Major benefits of bitmap indexing include:

2.2.2.1. Compact Storage and Reduced Response Time for queries: Fully

indexing an RDF repository with traditional indexes can be prohibitively expensive in

14

terms of space because an index can be several times larger than actual RDF data. Bitmap

indexes are only a fraction of the size of the data being indexed. This compact and

concise representation helps to save space and reduce computation while searching for a

RDF triple.

2.2.2.2. Very efficient parallel Data Manipulation and Loading: In our

methodology, sources advertise their capabilities and contents in the form of RDF triples

to the global ontology. A single source may contain bulks of RDF triples. Bitmap indexes

are very efficient in bulk processing of data manipulation statements and data loading.

In a nutshell, we have discussed different data integration approaches that are widely

used now a day. Ontology and its modeling languages have been highlighted because

they can help data integration systems to cope with the semantic heterogeneities that exist

in the domain of discourse. Finally indexing has been discussed in general to speed up the

querying mechanism and in particular bitmap indexing has been explained that can be

used to traverse semantic web metadata efficiently.

15

0

1

0

1

0

1

1

1

0

1

0

1

0

1

1

1

0

1

0

1

0

1

1

1

0

1

0

1

0

1

1

1

0

1

0

1

0

1

1

1

A

X

Y

G

T

U

V

Z

Bitmap VectorsSearch Key

Figure 4: Structure of a bitmap index

LITERATURE SURVEY

Relevant data source selection in query reformulation for data integration systems has

attracted significant attention in the literature over the last few decades [5, 6, 7, 8, 11, 12,

19, 20, 21, 24]. This chapter starts with the discussion and evaluation of state of the art

algorithms used in data integration systems for the identification of relevant data sources

during query reformulation.

11 Query Reformulation

In query reformulation, a user’s query previously written in terms of a mediated

schema, need to be reformulated or rewritten into queries that refer directly to the

schemas of underlying data sources [10, 11, 19, 24]. In literature, query reformulation can

be further sub-divided into two steps: (a) relevant source selection (b) query rewriting.

3.1.1. Relevant source identification: Before executing user queries, relevant and

effective sources should be clearly identified because all the available data sources may

not contribute significantly. Relevance reasoning is the process of identifying relevant

sources and pruning irrelevant and redundant data sources.

3.1.2. Query rewriting: Once relevant sources are being identified then query rewriting

is performed and source specific queries are generated only for those sources that have

been found relevant and can contribute some result to the user’s query.

16

CHAPTER 3

12 State of the art techniques

The main focus of this research is to propose an algorithm that can speed up the

process of relevance reasoning. The following section elaborates state of the art

algorithms that are used in different data integration systems for the relevant source

selection during query reformulation.

3.1.3. The Bucket Algorithm: This algorithm has been used in the Information

Manifold (IM) [1, 20], a system for browsing and querying of multiple networked

information sources. IM provides a mechanism to describe the contents and the

capabilities of data sources in source descriptions (which in our architecture is called

source models). Bucket algorithm uses source descriptions to create query plans that can

access several information sources to answer a query. This algorithm prunes irrelevant

data sources using source descriptions and reformulate source specific queries for only

relevant data sources. In order to describe and reason about the contents of data sources,

the relational model (augmented with certain object oriented features) is used in IM.

Technically, algorithm constructs a number of buckets and checks a user query with each

bucket for the identification of relevant data sources. Once relevant buckets for the

sources are being identified then source specific conjunctive queries are rewritten for

each source.

3.1.4. The Inverse-Rules Algorithm: InfoMaster is an information integration system1

[19] that provides an integrated access to multiple, distributed, and heterogeneous

information sources on the Internet. InfoMaster creates a virtual data warehouse. The

1

http://infomaster.stanford.edu/

17

algorithm used behind the InfoMaster is Inverse-Rules algorithm. Inverse-Rules

algorithm rewrites the definition of data sources by constructing a set of rules. A set of

rules are reformulated for defining the contents and the capabilities of each data source.

During rules construction heterogeneities among the data sources are dealt. These rules

guide the algorithm that how to compute records from data sources using source

definitions. The algorithm dynamically determines an efficient way to answer the user's

query using as few sources as necessary. In simple words, they are not reformulating the

query rather they are reformulating the source definitions so that the original query can be

easily answered on the reformulated rules.

3.2.3 The MiniCon Algorithm: MiniCon algorithm [19, 21] improved the Bucket

algorithm. The main focus of developing MiniCon algorithm is to pay attention to

performance aspects of query reformulation algorithms. MiniCon algorithm finds the

maximally contained rewriting of a conjunctive query using a set of conjunctive views.

Bucket algorithm completes in two steps: computing the buckets, and then reformulating

the source-specific queries using the buckets of those data sources which are relevant.

The main complexities involved in the bucket algorithm include: (a) If the number of

sound data sources is small, the Bucket algorithm may generate a large number of

candidate solutions and then reject them. (b) The exponential conjunctive query

containment test that is used to validate each candidate solution. MiniCon algorithm pays

attention to the interaction of the variables in the user query and in the source definitions

to prune the sources that are rejected later in the containment test. This timely detection

of irrelevant data sources improves the performance of MiniCon algorithm due to small

number of combinations to be checked.

18

3.2.4. The Shared-Variable-Bucket Algorithm: This design goal of this algorithm

[38] is to recover the deficiencies of the Bucket algorithm and develop an efficient

algorithm for query reformulation. The key idea underlying this algorithm is to examine

the shared variables and reduce the bucket contents to reduce view combinations. This

reduction ultimately optimized second phase of the algorithm.

3.2.5. The CoreCover Algorithm: In this algorithm [39], views are materialized from

source relations. The main aim of this algorithm is to find those rewritings which are

guaranteed to produce an optimal physical plan. Their divergence is mostly towards the

query optimization therefore different cost models are also considered in this algorithm.

The algorithm is trying to find an equivalent rewriting rather than a contained rewriting.

3.3. Critical Analysis

The CoreCover algorithm [39] is different from other query reformulation

algorithms in the following perspectives. Firstly, it is trying to find an equivalent

rewriting whereas all the other algorithms are finding a maximally-contained source-

specific rewriting of the query. Secondly, closed-world assumption is taken to find an

equivalent rewriting whereas all the other algorithms are taking open-world assumption.

Thirdly, reformulation stage of query processing has to guarantee an optimal plan for the

query. Bucket, MiniCon and Shared-Variable-Bucket algorithms are constructing the

buckets, and then taking the cartesian product of the buckets, to produce source-specific

rewritings. In Bucket algorithm, buckets constructed are large which causes a lot of

combinations to be computed and tested for the second phase. MiniCon and Shared-

Variable-Bucket algorithms prevent this deficiency. The MiniCon algorithm has been

shown to outperform both the Bucket and the Inverse-Rules algorithms [21]. Inverse-

19

Rules algorithm is query independent. The rules are computed once and are applied to all

queries. These rules are easy extendable for functional dependencies [19]. This algorithm

ignores the predicates during rewriting and requires an additional phase to remove the

irrelevant views, added to the algorithm [21]. None of the algorithm pays attention to fast

and efficient traversal of source descriptions. As number of sources grows, there

metadata information also grows. How to reduce the search space of metadata in the

process of relevance reasoning to make this whole process more efficient? This

ultimately leads to scalable data integration systems where sources can join and leave the

system arbitrarily and the query execution engine can synchronize itself with any change

and submits the sub-query to the relevant and available data sources. Another deficiency

of these algorithms is that most of them are using relational models for source

descriptions whereas the ontology based models can help us to represent fine-grained

distinctions between the contents and capabilities of the different data sources. This fine-

grained distinctions can help us reason about the data sources in a more precise and

efficient manner

In a nutshell, we have discussed state of the art algorithms, used for query

reformulation in data integration systems. These algorithms are analyzed and compared

with each other. The features and deficiencies of these algorithms are also illustrated.

20

PROPOSED ARCHITECTURE

In order to execute a user’s query in a scalable data integration system proposed in

[8], the query execution process needs to be optimized. We have proposed an ontology-

driven relevance reasoning architecture to improve response time for user query during

relevance reasoning. This chapter is organized into three major sections. In the first

section, components of the proposed relevance reasoning architecture are discussed. The

second section of the chapter explains the semantic matching process and proposed

scoring strategy. Finally the proposed methodology for relevance reasoning is discussed

in details and elaborated through an example.

13 PROPOSED ARCHITECTURE FOR THE RELEVANCE REASONING

This section presents the proposed architecture designed for relevance reasoning for

source selection in a data integration system. The proposed architecture, as shown in Fig.

5, comprises of different components. These are described as follows.

4.1.1. Global Ontology: The global ontology is a knowledge-base in the proposed

architecture. This helps in generating user queries and enabling semantic inference.

Major components of the global ontology are: (1) domain knowledge, represents domain

of discourse in the form of RDF triples. Each RDF triple is uniquely identified by the

global unique identifier (GUID). GUIDs are used in semantic indexing scheme for

relevance reasoning; (2) concepts and relationships hierarchies, represents semantic

relationships among concepts and relationships respectively. These hierarchies help in

resolving semantic heterogeneities that exist in a domain; (3) rule-base, a rule is an object

that can be applied to deduce inference from RDF triples. Every rule is identified by its

21

CHAPTER 4

name and consisted of two parts. (a) An antecedent, which is known as body of the rule

and (b) a consequent which is known as head of the rule. The rule-base is an object that

consists of rules; (4) rules-index, computes and maintains deduced inferences by applying

a specific set of rule-bases in order to optimize reasoning.

4.1.2. Ontology Management Service: Ontology management service facilitates the

creation and maintenance of the global ontology. It provides a set of application program

interfaces (APIs) to perform the following functionalities: (1) publishes the domain

knowledge in the form of RDF triples by assigning GUIDs to the RDF metadata triples

and mapping GUIDs over the bitmap index; (2) defines semantic operators and constructs

concept and relationship hierarchies; (3) provides a mechanism to create and drop a rule-

base and modifies the set of rules from a rule-base; (4) enables the creation and

maintenance of the rules-index and synchronizes it after rules are modified into the rule-

base.

22

4.1.3. Source Descriptions Storage (SDS): Source description is the metadata of a data

source. This metadata can be further classified into source metadata and content

metadata. In order to make source description of a data source interoperable in a

heterogeneous environment, they are described in a conceptual model in the form of a

local ontology [8]. The metadata of a data source is expressed as RDF triples in the local

ontology. These RDF triples are assigned local unique identifiers (LUIDs) using a

sequence generating object of each data source. In a nutshell, we can say that source

descriptions storage is a set of local ontologies.

4.1.4. Source Registration Service: Source registration service facilitates the creation

and maintenance of a local ontology for a data source in the source description storage. It

provides a set of application program interfaces (APIs) to perform the following

functionalities: (1) creates a unique sequence number generating object for the incoming

23

Figure 5 Proposed Architecture for Relevance Reasoning in Data Integration Systems

data source, (2) creates a local ontology to hold the RDF triples advertised by a data

source, (3) registers the local ontology into the source description storage, (5) inserts the

RDF triples of the data source into its corresponding local ontology.

4.1.5. Bitmap Index Storage: A bitmap index is a cross-tab structure of bits [26, 28].

We employ bitmap index for efficient traversal during relevance reasoning. A bitmap

index is divided into bitmap-segments. Internally, data in the bitmap segment is

represented in the form of bits. Each data source retains one bitmap segment over the

bitmap index. In the proposed architecture, data sources are represented on vertical side

of the index whereas RDF triples of the global ontology are represented on horizontal

side of the index. A bit state is unset i.e., 0 if a data source does not contain the

corresponding RDF triple and is set i.e., 1 if a data source contains corresponding RDF

triple. A sequence number generating object is used to assign a unique identifier to each

bitmap segment.

4.1.6. Index Management Service: Index management service facilitates the creation

and maintenance of a bitmap segment for a data source in the bitmap index storage. It

provides a set of application program interfaces (APIs) to perform following

functionalities: (1) bitmap segment creation creates the bitmap segment for an incoming

data source and initializes all bits of the bitmap segment to 0 (means unset); (2) bitmap

synchronization updates the bitmap segment of a data source consistent against its local

ontology; (3) shuffle bit shuffles the bits of a bitmap segment during synchronization.

4.1.7. Index Lookup Service: Index lookup service facilitates an efficient traversal of

the bitmap index. It provides a set of application program interfaces (APIs) to perform

following functionalities: (1) relevant source identification traverses the bitmap index

24

against the RDF triple and identifies the bitmap segments where the bit is set; (2)

irrelevant source pruning traverses the bitmap index against the RDF triple and identifies

the irrelevant bitmap segments where the bit is unset.

4.1.8. Ontology Reasoning Service: Ontology Reasoning Service enables the reasoning

and inference capabilities to the proposed architecture. It provides a set of application

program interfaces (APIs) to perform the following functionalities. (1) Semantic

Matching: is the process of finding semantic similarity among the different terms

(concepts and relation-ships) in order to resolve the semantic heterogeneities. (2)

Inference and Reasoning: provides reasoning and inference to the semantic matching

process by incorporating rules, rules-base, and rules-index. (3) Semantic Query

Generation: generates queries against the global ontology using semantic operators

during the semantic matching. Note that these queries are different from the user query so

these should not be inter-mixed or confused.

4.1.9. Relevance Reasoning Service: Relevance reasoning service identifies relevant

and effective data sources for a query using index lookup service from bitmap index. It

provides a set of application program interfaces (APIs) to perform following

functionalities. (1) Semantic query expansion expands a user query to semantically

relevant RDF triples. (2) Relevance reasoning identifies relevant and effective data

sources for a given user’s query. (3) Relevance ranking ranks the data sources for a given

user query based on the semantic similarity score obtained.

14 Semantic Matching & Source Ranking for RDF Triples

4.1.10. Relevance Levels and Proposed Scoring Strategies: During the semantic

matching, the terms of user’s query triples are matched with the terms of source triples.

25

As a result one of the five relevance levels can be obtained for each term. These

relevance levels are given numeric scores for the purpose of quantification that will help

us to rank a source for a given query. Following is the definition and explanation of the

relevance levels and operators used in semantic matching process.

4.1.10.1. Exact Matching ( ): A term is exact match of another term if and only if

both are lexically equal to each other. For example a term nust:Instructor is an exact

match of niit:Instructor. A numeric score of 1 is assigned to any exact matching terms as

soon as they appear in RDF triple.

4.1.10.2. Synonym Matching ( ): It is unrealistic to assume that same name will

be used for a concept in a domain. An explicit specification of synonyms using some

operator is required. Therefore synonyms are the terms that are different lexically but

have the same meaning. For example a term nust:Instructor is synonym of the another

term niit:Teacher. A numeric score of 0.8 is assigned to any synonym matching terms as

soon as they appear in RDF triple. We are using owl:sameAs operator for specifying

mappings in the rule-base of the global ontology.

4.1.10.3. Subclass Matching ( ): In some scenarios taxonomies might be used for

the purpose of knowledge representation where generic concepts subsume specific

concepts. In order to cope with subsumption relationship, some operators are required for

explicit specification. Therefore a term is a subclass of another term if and only if it is

subsumed by that term. For example nust:Employee might subsumes the niit:Instructor.

A numeric score of 0.6 is assigned to any sub class matching terms as soon as they appear

26

in RDF triple. We are using rdf:subClassOf operator for specifying mappings in the

rule-base of global ontology.

4.1.10.4. Degree of likelihood ( ): In some situations data sources might contain

concepts that are not totally disjoint or different rather they would be related to some

other term with some degree of likelihood. For example the term nust:Instructor might be

relevant to nust:TeacherAssistant with some degree of likelihood. This type of mappings

cannot be specified using previously defined operators. A numeric score of 0.5 is

assigned to any likelihood based similar terms as soon as they appear in RDF triple. We

are using owl:equivalentOf operator for specifying mappings in the rulebase of global

ontology.

4.1.10.5. Disjoint ( ): A term is disjoint from another term if and only if they are

different from each other. For example the term nust:Instructor is disjoint from

nust:Student. A numeric score of 0.0 is assigned any disjoint terms as soon as they appear

in any components of RDF triple. These relevance levels and their scoring strategies can

be summarized in Table 1 below:

1 exact match 1.02 sameAs 0.83 subClassOf 0.64 equivalentOf 0.55 disjointFrom 0.0

4.1.11. Term Similarity: We use the same semantic matching strategy for both concepts

and relation-ships. We have concept hierarchy and relation-ship hierarchy. Terms include

both concepts and relationship. We extract the relationship between the query and source

27

Table 1 Relevance levels and scoring strategy

terms using their respective hierarchies and then assign standard relevance score as

defined in the Table 1. An RDF triple contains the subject, predicate, and object. Subject

and object are considered as concepts thereby their similarity is computed using concepts

hierarchy whereas to calculate the predicate similarity, the relation-ship hierarchy is used.

4.1.12. RDF Triple Similarity: To calculate the relevance between user query and

source RDF triples, we combine both aspects of term similarity (i.e., concepts and

relation-ships). The overall RDF triple similarity can be calculated as shown in equation

1:

Where qT denotes the query triple and S denotes source triples. qt and st are query and

source terms respectively that are to be matched, Sim (qT,s) the overall similarity of a

single query triple for a given source. Here i and j represent ith and jth source RDF triples

and query triple terms respectively.

4.1.13. Source Ranking: A user query and source RDF triples are matched to find the

similarity of each query triple with data source triples. Once RDF triple similarity has

been computed, source score of the whole query is being computed using the formula

given in equation 2. Based on the score obtained for a query, data sources are ranked.

In the above equation, simsrc is the total score of a source (s) for a user query

(obtained by multiplying the similarity score of all query triples). qi denotes the query

triples and n denotes the total number of query triples.

28

15 Proposed Semantic Matching Methodology

This section discusses our proposed methodology for relevance reasoning to identify

the most relevant and effective data sources using a bitmap index. Our proposed

methodology can be divided into three main workflows. These workflows help to

understand the intricacies of the proposed architecture. Below is the detail discussion of

each workflow.

4.1.14. Ontology Management Workflow: Ontology management workflow manages

the global ontology in the architecture. Ontology management service plays a prominent

part in this workflow. Five major activities carried out by ontology management

workflow include:

Domain knowledge representation

Concept & relationship hierarchy representation

Rules & Rules-base management

Rules-index management

The Figure 6 shows all the activities that are performed during the ontology

management workflow using sequence diagram.

29

Figure 6: Sequence Diagram for Ontology Management Workflow

Domain knowledge representation is the registration of the RDF triples over the

global ontology. These RDF triples are stored in the global ontology and GUIDs are

assigned using a unique sequence number generator object. GUIDs are allocated

positions over the bitmap index. Transactions are permanently recorded to the global

ontology. The snippet in Figure 7 shows pseudo-code for insertion of RDF triple in the

global ontology. In the preceding chapter its implementation issues and details are

discussed.

30

Pseudo-Code for Domain Knowledge Registration

For each RDF triple of global ontologyAssign GUID to RDF tripleAdd RDF triple to the global ontologyExtend bitmap indexIncrease the length of bitmap pattern by oneAssign location to the RDF triple reserved over the bitmap indexPerform commit to apply changes persistently to global ontology

Concept & relationship hierarchy representation involves the definition of semantic

operators and then using these operators to build their respective hierarchies. These

operators include sameAS, equivalentOf, subclassOf, and disjointFrom, as explained in

the previous section. RDF triples are added to the global ontology to represent the

concept and relationship hierarchy. Bitmap index is not maintained for these RDF triples.

Rules & Rules-base management involves the creation of the rules-base and then

inserting rules into the rules-base. In order to reduce mappings among the hierarchies and

increase the inference capabilities of rule-base, two rules are inserted for each semantic

operator. These rules include InverseOf<operator> and TransitiveOf<operator>.

InverseOf<operator> rule tells the rule-base that if a terms A is related to another term B

with relation R, and then B is related to A using R-1. Fig. 8 shows the N3 representation

of the InverseOf rule for sameAs operator in the semantic web rule language.

TransitiveOf<operator> rule tells the rule-base that if a term A is related to another

term B with some relation R, and the same term B is further related to another term C

using the relation R, it implies that the term A is related to term C using the same relation

R. Fig. 9 shows the N3 representation of the TransitiveOf rule for sameAs operator in the

semantic web rule language.

31

Figure 7 Pseudo-code for RDF triple registration of global ontology

: Def-InverseOfSameAs@swrl(“(?x sameAs ?y) -> (?y sameAs ?x)”)

Figure 8 InverseOf SameAs rule inserted in the rule-base

: Def-TransitiveOfSameAs@swrl (“(?x sameAs ?y) (?y sameAs ?z) -> (?x sameAs ?z)”)

Figure 9 TransitiveOf SameAs rule inserted in the rule-base

Rules-index management involves the creation and management of the rules-index

for a rules-base. Once the rules are inserted into the rules-base, the corresponding rules-

index is refreshed to pre-compute RDF triples.

4.1.15. Source Registration Workflow: Source registration workflow registers the data

sources in the data integration system. Three major activities carried out by source

registration workflow include

Local ontology creation

Bitmap segment creation

Bitmap synchronization

Local ontology creation involves the creation of local ontology for incoming data

source, a unique sequence number generator object along with the insertion of RDF

triples over the created ontology. Source registration service plays a prominent part in

local ontology creation. Ontology is created for the incoming data source and is

registered with the source descriptions storage. The RDF triples, advertised by the data

source, are assigned unique identifiers (LUIDs) and are added to the local ontology.

Transactions are permanently recorded to the source descriptions storage. The snippet in

Figure 10 shows pseudo-code for local ontology creation and its RDF triples insertion. In

the preceding chapter its implementation issues and details are discussed.

32

Pseudo-Code for Local Ontology Creation

Creating ontology for incoming source in Source Descriptions StorageCreating unique sequence generator for incoming source RDF triplesAssign LUIDs to the RDF triplesAdd RDF triple to the local ontology in Source Descriptions StoragePerform commit to apply changes persistently to global ontology

Figure 10 Pseudo-code for RDF triple creation of local ontology

Bitmap segment creation involves the cloning of bitmap pattern and the creation of

bitmap segment for incoming data sources over the bitmap index. The index management

service plays a prominent role in bitmap segment creation. The bitmap pattern is stored

over the global ontology. It is cloned for the newly created bitmap segment. Initially all

the bits are initialized to unset i.e., 0. A unique identifier is assigned to the bitmap

segment and is added to the bitmap index. The snippet in Figure 11 shows pseudo-code

for local ontology creation and its RDF triples insertion. In the preceding chapter its

implementation issues and details are discussed.

Bitmap Synchronization involves plotting the RDF triples of a data source

consistently and correctly by shuffling the bits in its bitmap segment. The index

management service plays a prominent role by spawning a listener process that listens for

any invalidation (those changes in local ontology that are not propagated and plotted over

the bitmap index) in the source descriptions storage. If any invalidation is found, it starts

index synchronization. During synchronization RDF triples of the data source are fetched.

Every RDF triple is decomposed into terms (subject, predicate, and object) and given to

ontology reasoning service. The ontology reasoning service performs reasoning and

inference that helps the index management service to extracts GUIDs for the

corresponding RDF triple. The position of the GUIDs is identified over the bitmap index

33

Pseudo-Code for Bitmap Segment Creation

Check whether bitmap segment exists for the incoming source If (no) Clone bitmap pattern from global ontology RDF triples Initialize bits to zero (0) Assign a unique number to the bitmap segment Add bitmap segment to the bitmap index for incoming source Perform commit to apply changes persistently in index

Figure 11 Pseudo-Code for Bitmap Segment Creation

Pseudo-Code for Bitmap Synchronization

For each incoming RDF triple advertised by a data source Decompose RDF triple into its components Perform reasoning for semantic similarity Extract GUID for the corresponding RDF triple Identify its position over the bitmap index Fetch the bitmap segment for the data source Shuffle the bit to 1 at the corresponding position in the bitmap segment Perform commit to apply changes persistently in index

and the bits are shuffled accordingly. The snippet in Figure 12 shows pseudo-code for the

bitmap synchronization. In the preceding chapter its implementation issues and details are

discussed.

The Figure 13 shows all the activities that are performed during the source

registration workflow using sequence diagram.

34

Figure 12 Pseudo-Code for Bitmap Synchronization

Figure 13: Sequence Diagram for Source Registration Workflow

Pseudo-Code for Bitmap Synchronization

For each incoming RDF triple advertised by a data source Decompose RDF triple into its components Perform reasoning for semantic similarity Extract GUID for the corresponding RDF triple Identify its position over the bitmap index Fetch the bitmap segment for the data source Shuffle the bit to 1 at the corresponding position in the bitmap segment Perform commit to apply changes persistently in index

4.3.3 Relevance Reasoning Workflow: Relevance reasoning workflow includes the

steps that are carried out to identify the relevant and effective data sources for the user’s

query. Relevance reasoning service plays a prominent part in this workflow. It

incorporates with the index lookup service and ontology reasoning service during

relevance reasoning to perform the following activities.

Semantic Query Expansion

Source Selection

35

Source Ranking

The Figure 14 shows all the activities that are performed during the source

registration workflow using sequence diagram.

Semantic query expansion: A user submits the query in RDF which is passed to the

relevance reasoning service. The RDF triples that are entered by the user into a query are

called asserted query triples. A user can submit queries in global ontology terms as well

as local ontology terms of their underlying data sources. Relevance reasoning service

expands the user query to all possible combinations using ontology reasoning service.

Every term of the query triple is expanded using semantic operators for synonyms, lexical

36

Pseudo-code for Query Expansion in Relevance Reasoning

InferredTriplesList = ØFor each RDF triple in AssertedTripleList of user’s query Isolate subject, object, and property of current RDF triple Calculate semantic similarity and add relevant term for the subject of RDF triple Calculate semantic similarity and add relevant term for the property of RDF triple Calculate semantic similarity and add relevant term for the object of RDF triple Take Cartesian product of terms Populate InferredTriplesList of the Cartesian product Return InferredTriplesList

Figure 14: Sequence Diagram for Relevance Reasoning Workflow

variants, subsumption, and degree of likelihood. This expansion results in the addition of

some extra triples to the user query. These RDF triples are called inferred query triples.

The snippet in Figure 15 shows pseudo-code for the semantic query expansion. In the

preceding chapter its implementation issues and details are discussed.

Source Selection: Once the query is expanded with semantically relevant RDF triples,

the GUIDs are reconciled from the global ontology. GUIDs help to find out the position

of RDF triples over the bitmap index. These positions are passed to the index lookup

service which traverses the bitmap segments of each source at the corresponding

positions and identifies the data sources for which the bits are set. The snippet in Figure

16 shows pseudo-code for the source selection. In the preceding chapter its

implementation issues and details are discussed.

37

Figure 15 Pseudo-Code for Query Expansion in Relevance Reasoning Workflow

Pseudo-code for Source Selection in Relevance Reasoning

RelevantSourceList = ØFor each RDF triple in users query [Asserted + Inferred] Reconcile GUID for incoming RDF triple from global ontology Identify Bitmap location of the RDF triple using GUID Pass bitmap location to Index lookup service Traverse bitmap segments at corresponding location to identify relevant sources Add sources to RelevantSourceListReturn RelevantSourceList

Pseudo-code for Query Expansion in Relevance Reasoning

InferredTriplesList = ØFor each RDF triple in AssertedTripleList of user’s query Isolate subject, object, and property of current RDF triple Calculate semantic similarity and add relevant term for the subject of RDF triple Calculate semantic similarity and add relevant term for the property of RDF triple Calculate semantic similarity and add relevant term for the object of RDF triple Take Cartesian product of terms Populate InferredTriplesList of the Cartesian product Return InferredTriplesList

Source Ranking: The identified data sources are ranked according to their relevance

to the user query. Table 1 shows our scoring scheme. Initially term similarity is computed

for component of query RDF triple in a given source. Once term similarity is computed it

is used in the equation 1 to compute RDF triple similarity. Finally source similarity is

computed by equation 2 and sources are ranked according to the score obtained for a

given user query.

16 Explanation of Proposed Methodology using a Case Study

We are using a portion of the famous university ontology as an example. In the

scenario, we have a global ontology with name NUST_DB as shown in Figure 17, and

three data sources named EME_DB, MCS_DB, and NIIT_DB. The RDF triples of the

global ontology are shown in Table 2.

NUST_RDF_DATA

GUID RDF Triplesnust-1000001 < nust:Instructor, nust:isTeaching, nust:Course >nust-1000002 < nust:Instructor, nust:isAdvisorOf, nust:Student >nust-1000003 < nust:Student, nust:isRegisteredIn, nust:Course >nust-1000004 < nust:Student, nust:hasMajor, nust:Department >nust-1000005 < nust:Instructor, nust:worksIn, nust:Department >nust-1000006 < nust:TeacherAssistant, nust:isAssisting, nust:Course >

38

Figure 16 Pseudo-Code for Source Selection in Relevance Reasoning Workflow

Table 2 RDF triples of the Global Ontology

Figure 17 Snap shot of the Global Ontology

isTeaching

TeachingAssistant

Instructor

isAssisting

Department

Course

Student

worksIn

hasMajor

isAdvisorOf isRegisteredIn

The RDF triples of the global ontology forms basis for the bitmap indexing in our

proposed architecture. The pattern of the index can be illustrated as shown in Table 3.

Source-segment position-1 position-2 position-3 position-4 position-5 position-6

xxxxxxxxxxxxxx nust-1000001 nust-1000002 nust-1000003 nust-1000004 nust-1000005 nust-1000006

In order to manage concepts and relation-ship hierarchies, the semantic matching

operators defined are sameAs, equivalentOf, subclassOf, and disjointFrom. The concepts

like nust:Instructor is mapped with the concept niit:Lecturer using subClassof operator in

order to specify subsumption relationships. The term nust:Course is mapped with the

term nust:Subject using sameAs operator in order to specify synonyms and lexical

variants. Similarly nust:Instructor is mapped with nust:TeachingAssistant using

equivalentOf operator in order to specify degree of likelihood and so on. Relation-ship

hierarchies are also managed accordingly. These hierarchies can be illustrated as shown

in Figure 18.

Three local ontologies are being created for the data sources with the naming

convention like <DataSource>_RDF_Data. There are semantic heterogeneities between

39

Table 3 Structure of Bitmap Index

Bitmap Pattern

Figure 18 Concept & Relationship Hierarchies Managed using Semantic Operators over Global Ontology

isTeaching

Teaching Teaches

sameAs sameAs

TeachingAssistant

Student

subClassofisAssisting

Subject

Course

sameAs

Instructor

Professor Teache

r

Lecturer

Prof

subClassof

sameAs

subClassof

subClassof

ExactMatch

the contents of the data sources. Table 4 describes the RDF triples of the sources stored in

their respective ontologies.

EME_RDF_DATALocal Link-ID RDF Tripleseme-1011 < eme:Professor , eme:Teaches, eme:Subject >eme-1012 < eme: Professor, eme:Advises, eme:Student >eme-1013 < eme:Student, eme:RegisteredIn, eme:Subject >NIMS_RDF_DATALocal Link-ID RDF Triplesnims-2011 < nims:Teacher, nims:isAdvisorOf, nims:Student >nims-2012 < nims: Teacher, nims:WorksIn, nims:Department >nims-2013 < nims:Student, nims:hasMajor, nims:Department >NIIT_RDF_DATALocal Link-ID RDF Triplesniit-3011 < niit:Lecturer, niit:isTeaching, niit:Course >niit-3011 < niit:TeachingAssistant, niit:isAsssting, niit:Course >

The prefixes nust, niit, eme, and nims refer to URLs http://www.nust.edu.pk,

http://www.niit.edu.pk, http://www.nims.edu.pk, and http://www.eme.edu.pk respectively.

Once the local ontologies are being created, the index management service comes into play and

creates the bitmap segments in the bitmap index for the data sources and plots (synchronizes)

the RDF triples of the data sources in their respective bitmap segments. During synchronization,

index management service also resolves the semantic heterogeneities. The structure of the

bitmap index can be illustrated as shown in the Table 5.

Source-segment nust-1000001 nust-1000002 nust-1000003 nust-1000004 nust-1000005 nust-1000006

EME-DB 1 1 1 0 0 0NIMS-DB 0 1 0 1 1 0NIIT-DB 1 0 1 0 0 1

Suppose, a user query contains RDF triple i.e., <Instructor isTeaching Course>.

Relevance reasoning service decomposes this triple into its terms and creates three

buckets i.e., one for the subject, one for the property, and one for the object. Each term is

given to ontology reasoning service to calculate its semantic similarity in their respective

hierarchies to find relevant terms. The buckets are populated as shown in the Table 6.

40

Table 4 RDF triples of the data sources

Table 5 Structure of Bitmap Index after sources are registered

Table 6 Buckets created for the RDF triples

http://www.eme.edu.pk/

http://www.nims.edu.pk/

http://www.niit.edu.pk/

http://www.nust.edu.pk/

Semantic Operator Used

Subject Bucket for “Instructor”Terms Deduced

Property Bucket for “isTeaching”Terms Deduced

Property Bucket for “Course”Terms Deduced

exactMatch Instructor isTeaching CoursesameAs NULL Teaching, Teaches SubjectsubClassOf Professor, Prof, Lecturer, Teacher NULL NULLequivalentOf TeachingAssistant isAssisting NULL

The cartesian product of subject, property and object is taken to construct inferred

triple list. Table 7 shows their cartesian product.

Expansion of RDF triple using Ontology Reasoning Service<Instructor>, <isTeaching>, <Course><Instructor>, <Teaching>, <Course><Instructor>, <Teaches>, <Course><Instructor>, <isAssisting>, <Course>…. … ….<Instructor>, <isAssisting>, <Subject><Professor>, <isTeaching>, <Course><Professor>, <Teaching>, <Course><Professor>, <Teaches>, <Course>… … …<Professor>, <isAssisting>, <Subject><Prof>, <isTeaching>, <Course><Prof>, <Teaching>, <Course><Prof>, <isAssisting>, <Subject><Lecturer>, <isTeaching>, <Course><Lecturer>, <Teaching>, <Course><Lecturer>, <Teaches>, <Course>… … …<Teacher>, <Teaching>, <Course><Teacher>, <Teaches>, <Course>… … …<Teacher>, <isAssisting>, <Subject><TeachingAssistant>, <isTeaching>, <Course><TeachingAssistant>, <Teaching>, <Course><TeachingAssistant>, <Teaches>, <Course><TeachingAssistant>, <isAssisting>, <Course>… … …<TeachingAssistant>, <isAssisting>, <Subject>

In order to execute a query over the bitmap index, GUIDs are needed. The RDF triple

is rejected, if no GUID is available for it in the global ontology. In this example, GUID

nust-1000001 and nust-1000006 are fetched from the global ontology. These GUIDs are

passed to the index lookup service to identify relevant and effective data sources. The

41

Table 7 Inferred RDF triples for a user’s query triple

index lookup service traverses the bitmap index for only these GUIDs and returns all

bitmap segments where the bits are set i.e., EME-DB, and NIIT-DB.

In order to sort the data sources based on their relevance to the query triples, semantic

similarity scoring is incorporated as shown in Table 1. First the term similarity is

computed for the query triples with data source triples using the concept and relationship

hierarchies.

EME-DB scores 0.6 for matching subject of the query triple Instructor with subject of

the source triples Professor. The concept’s hierarchy returns subClassOf relationship

between these terms. Next properties of the query and source triples are matched and

scores 0.8 for matching the respective properties isTeaching and Teaches, because they

are connected by sameAs relationship. Finally object of the query and source triples are

matched which scores 0.8 for matching the respective objects Course and Subject.

NIIT-DB scores 0.6 for matching the subject of the query triple Instructor with the

subject of the source triple Lecturer. The concept hierarchy returns subClassOf

relationship for this match. This data source scores 1 for matching the property

isTeaching with query property isTeaching. Finally scores 1 for matching the respective

objects Course and Course. NIIT-DB also contains a triple that is relevant to the query

triple with some degree of likelihood i.e., nust-1000006.

The relevance of a data source for every query triple is calculated by putting the term

similarity scores into the equation 1 and is shown in Table 8.

Term Similarity

42

Table 8: Semantic Similarity Calculation of a Data Source for a User Query Triple

Relevant Data Source

GUIDs Source Similarity for Query Triple(qT)

sim (subject) sim (property) sim (object)

EME-DB nust-1000001 0.6 0.8 0.8 0.384

NIIT-DB nust-1000001 0.6 1 1 0.6nust-1000006 0.5 0.5 1 0.25

Finally, the overall similarity score of a data source for a user’s query is calculated by

using the equation 2 and is shown in Table 9. These sources are sorted and given to query

rewriting component.

Relevant Data Source

GUIDs Source Similarity for Query Triple(qT)

EME-DB nust-1000001: 0.384Total Source Similarity for User Query (simEME) (0.384)

NIIT-DB nust-1000001: 0.6nust-1000006: 0.25

Total Source Similarity for User Query(simNIIT) (0.85)

In a nutshell, we have explained our proposed architecture of relevance reasoning for

source selection in data integration. Different workflows are highlighted and semantic

matching methodology has been explained using a case study.

43

Table 9: Semantic Similarity Calculation of a Data Source for User Query

IMPLEMENTATION

This chapter discusses our implementation strategy and issues for the proposed

architecture. The first section of this chapter discusses in details the Oracle

implementation of the ontologies and RDF data. The second section discusses the

implementation details of our proposed architecture for the relevance reasoning.

17 RDF data/ Ontologies in Oracle Database

In Oracle Database2 10g Release 2, a new data model has been developed for storing

RDF and OWL data. This functionality builds on the recent Oracle Spatial Network Data

Model (NDM), which is the Oracle solution for managing graphs within the Oracle

Database. The RDF Data Model supports three types of database objects: model or

ontology (RDF graph consisting of a set of triples), rule-base (set of rules), and rule index

(entailed RDF graph).

5.1.1. RDF Data Model or Ontology: There is a single universe for all RDF data stored

in the database. All RDF triples are parsed and stored in the system under the MDSYS

schema as shown in Figure 19. An RDF triple (subject, predicate, and object) is treated as

one database object. A single RDF document that contains multiple triples, therefore,

results in many database objects.

RDF_MODEL$ is a system level table created to store information on all of the RDF

and OWL ontologies in a database. Whenever a new ontology is created, new

MODEL_ID is automatically generated for it. An entry is made into the RDF_MODEL$

table.

2 http://www.oracle.com/index.html

44

CHAPTER 5

The RDF_NODE$ table stores the VALUE_ID for text values that participate in

subjects or objects of statements. The NODE_ID is the same as the VALUE_ID.

NODE_ID values are stored once, regardless of the number of subjects or objects they

participate in. The node table allows RDF data to be exposed to all of the analytical

functions and APIs available in the core NDM.

The LINKS$ table stores the triples for all of the RDF models in the database.

Therefore, the MODEL_ID logically partitions the RDF_LINK$ table. Selecting all of

the links for a specified MODEL_ID returns the RDF network for that particular

ontology.

The RDF_VALUE$ table stores the text values, i.e. the Uniform Resource Identifiers

or literals for each part of the triple. Each text value is stored only once, and a unique

VALUE_ID is generated for the text entry. URIs, blank nodes, plain literals and typed

literals are all possible VALUE_TYPE entries.

45

Figure 19 Database Schema to store ontology in Oracle NDM

Blank nodes are used to represent unknown objects, and when the relationship

between a subject node and an object node is n-ary. New blank nodes are automatically

generated whenever blank nodes are encountered in triples. However, it is possible for

users to re-use blank nodes, for example when inserting data into a containers or

collections. The RDF_BLANK_NODE$ table stores the original names of blank nodes

that are to be reused when encountered in triples.

To represent a reified statement a resource is created using the LINK_ID of the triple.

The resource can then be used as the subject or object of a statement. To process a

reification statement, a triple is first entered with the reified statement’s resource as

subject, rdf:type as property and rdf:Statement as object. A triple is then entered for each

assertion about the reified statement. However, each reified statement will have only one

rdf:type to rdf:Statement associated with it, despite the number of assertions made using

this resource.

The Oracle RDF Data Model supports containers and collections. A container or

collection will have a rdf:type to rdf:container_name or rdf:collection_name associated

with it, and a LINK_TYPE of RDF_MEMBER.

Two new object types have been defined for RDF-modeled data. SDO_RDF_TRIPLE

serves as the triple representation of RDF data, whilst SDO_RDF_TRIPLE_S is defined

to store persistent data in the database. The GET_RDF_TRIPLE() function can be used to

return an SDO_RDF_TRIPLE type.

5.1.2. Rule-base: Oracle supplies both an RDF rule-base that implements the RDF

entailment rules, and an RDF Schema (RDFS) rule-base that implements the RDFS

entailment rules. Both rule-bases are automatically created when RDF support is added to

46

the database. It is also possible to create a user-defined rule-base for additional

specialized inference capabilities. For each rule-base, a system table is created to hold

rules in the rule-base, along with a system view of the rule-base. The view is used to

insert, delete and modify rules in the rule-base. Information about all rule-bases is

maintained in the rule-base information view.

For example, the rule that the head of department (HoD) is also a faculty member of

the department could be represented as follows:

('HeadofDepartRule', -- rule name

‘(?p :HoDOf ?d)’, -- IF side pattern

NULL, -- filter condition

‘(?p :FacultyMemberOf ?d)’, -- THEN side pattern

SDO_RDF_Aliases(MDSYS.RDF_Alias('', 'http://www.seecs.edu.pk/univontology/')))

In this case, the rule does not have a filter condition, so that the component of the

representation is NULL. Note that a THEN side pattern with more than one triple can be

used to infer multiple triples for each IF side match.

5.1.3. Rules Index: A rules index is an object containing pre-computed triples that can

be inferred from applying a specified set of rule-bases to a specified set of ontologies. If a

graph query refers to any rule-bases, a rule index must exist for each rule-base and

ontology combination in the query.

When a rule index is created, a view is also created of the RDF triples associated with

the index under the MDSYS schema. This view is visible only to the owner of the rules

index and to users with suitable privileges. Information about all rule indexes is

maintained in the rule index information view. Information about all database objects,

47

such as ontologies and rule-bases, related to rules indexes is maintained in the Rule Index

Datasets view.

5.1.4. Querying RDF Data: The SDO_RDF_MATCH function has been designed to

meet most of the requirements identified by W3C in SPARQL for graph querying. A Java

API is also provided for network representation and network analysis. Analysis

capabilities include the ability to find a path between two resources, or to find a path

between two resources when the links are of a specified type.

Use of the SDO_RDF_MATCH table function allows a graph query to be embedded

in a SQL query. It has the ability to search for an arbitrary pattern against the RDF data,

including inference, based on RDF, RDFS, and user-defined rules. It can automatically

resolve multiple representations of the same point in value space (e.g. “10” ^^xsd:Integer

from “10” ^^xsd:PositiveInteger).

18 Setting up the Stage for Implementation

The implementation of different components of the architecture is discussed in the

following subsections.

5.1.5. Enabling and Disabling the RDF Support in Database: Before using the RDF

support into a Oracle database, we need to enable this feature. A procedure named

CREATE_RDF_NETWORK() of the SDO_RDF package is used to enable RDF support

in the database. This procedure creates system tables and other database objects used for

RDF support. One must connect to the database as a user with DBA privileges in order to

call this procedure, and should call the procedure only once for the database. To remove

RDF support from the database, call the SDO_RDF.DROP_RDF_NETWORK

procedure. The following example enables the RDF support into the database.

48

Enabling the Semantic NetworkBEGIN SDO_RDF.CREATE_RDF_NETWORK('rdf_tblspace'); END;

5.1.6. Creating the Global Ontology: The table used to store the RDF triples of the

global ontology is shown below. The name of the table is GLOBAL_RDF_DATA.

Column Name Data type DescriptionGUID NUMBER GUID assigned to incoming RDF triple of the global

ontology.TRIPLE SDO_RDF_TRIPLE_S This column stores the subject, predicate, and object of the

RDF triple. TRIPLE_TYP VARCHAR2 This column distinguishes whether the RDF triple is a

rulebase(R) or metadata (M) triple.BIT_POS NUMBER If the RDF triple type is M, then this column stores the

position of the GUID over the bitmap index

A unique sequence generating object is used to assign GUIDs to the incoming RDF

triples. The example below shows the creation of the sequence generator object.

Creating the Sequence Generator for GUIDsCREATE SEQUENCE s_global_rdf_data_idSTART WITH 1000INCREMENT BY 1NOCACHE ORDER;

Once the global ontology table has been created, we then create the global

ontology using the CREATE_RDF_MODEL() procedure of the SDO_RDF package.

The example below creates the global ontology.

Creating the Global Ontology

BEGIN SDO_RDF.CREATE_RDF_MODEL('global_ontology', 'global_rdf_data', 'triple'); END;

This procedure adds the global ontology to the MDSYS.RDF_MODEL$ table. To

delete ontology, use the SDO_RDF.DROP_RDF_MODEL procedure.

5.1.7. Creating the Bitmap Index: The table used to store the bitmap segment is shown

below. The name of the table is BITMAP_INDX

49

Column Name Data type Description

SEGMENT_ID NUMBER A unique identifier assigned to bitmap segment created for an incoming data source.

SEGMENT_SOURCE URI This column stores the URI of the data sources.BITMAP_PATTERN VARCHAR2 This column stores the bits to represent RDF triples for a data

source.

A unique sequence generating object is created to assign segment identifiers to newly

created bitmap segments. The example below shows the creation of the sequence generator

object.

Creating Sequence Generator for Bitmap Segments

CREATE SEQUENCE s_bitmap_segment_idSTART WITH 1000INCREMENT BY 1NOCACHE ORDER;

5.1.8. Defining Semantic Operators and Creating Hierarchies: The semantic

operators like exactMatch, sameAs, equivalentOf, subClassOf have also been defined

over the global ontology. The following example shows the SQL to define sameAs

operator. The same syntax is used to define other operators.

Defining sameAs operatorINSERT INTO global_ontology_rdf_dataVALUES(s_global_rdf_data_id.NEXTVAL, SDO_RDF_TRIPLE_S(‘global_ontology’, 'http://www.niit.edu.pk/Research/Delsa/sameAs'

'http://www.w3.org/1999/02/22-rdf-syntax-ns#type' 'http://www.w3.org/1999/02/22-rdf-syntax-ns#Property'));

Once the semantic operators have been defined, they are used to manage the

concepts and relationship hierarchies. The code in following example links the

concept Course with Subject using sameAs operator to represent synonyms.

Managing Hierarchies

INSERT INTO global_ontology_rdf_dataVALUES(s_global_rdf_data_id.NEXTVAL, SDO_RDF_TRIPLE_S(‘global_ontology’, 'http://www.niit.edu.pk/Research/Delsa/Course'

'http://www.niit.edu.pk/Research/Delsa/sameAs' 'http://www.niit.edu.pk/Research/Delsa/Subject'));

50

5.1.9. Creating Rules, Rule-base and Rule Index: In order to create a user defined

rulebase, CREATE_RULEBASE() procedure of the SDO_RDF_INFERENCE package is used. The

following example creates a rulebase for the global ontology with name global_ontology_rb.

Creating Global Ontology RulebaseBEGIN

SDO_RDF_INFERENCE.CREATE_RULEBASE('global_ontology_rb');END;

After creating the rule-base, rules can be added to it. To cause the rules in the

rule-base to be applied in a query of RDF data, one can specify the rule-base in the

call to the SDO_RDF_MATCH table function. Inverse and transitive rules have

been inserted for each semantic operator. The following example explains the

implementation of these rules for sameAs operator.

Inverse Rule for sameAs OperatorINSERT INTO mdsys.rdfr_global_ontology_rbVALUES('InverseOfSameAs',

'(?x :sameAs ?y)', NULL, '(?y :sameAs ?x)',SDO_RDF_ALIASES(SDO_RDF_ALIAS('','http://www.niit.edu.pk/Research/Delsa/')));

Transitive Rule for sameAs OperatorINSERT INTO mdsys.rdfr_global_ontology_rbVALUES('TransitiveOfSameAs', '(?x :sameAs ?y) (?y :sameAs ?z)', NULL, '(?x : s ?z)',

SDO_RDF_ALIASES(SDO_RDF_ALIAS('','http://www.niit.edu.pk/Research/Delsa/')));

Whenever rules are inserted, updated, or deleted from the rule-base, rules index must

be refreshed. The following example creates the rule index for the global ontology rule-

base.

Rules Index CreationBEGIN SDO_RDF_INFERENCE.CREATE_RULES_INDEX (

51

'rdfs_rix_global_ontology', SDO_RDF_Models('global_ontology'), sdo_rdf_rulebases('RDFS','global_ontology_rb'));END;

19 Implementation of the Proposed Architecture for Relevance Reasoning

The Figure 20 shows the packaged diagram of the proposed architecture for relevance

reasoning in a scalable data integration system. The remaining section discusses the

functionality provided by each of these packages along with a brief description.

5.1.10. PACKAGE Source_Registratrion_Service: This package manages local

ontologies for the incoming data sources. It provides two procedures for this purpose.

5.1.10.1. REGISTER_SOURCE(): This procedure accepts the name along with the contents

of a incoming data source and creates the local ontology for it in source description storage.

Parameter Name Data type Description

p_incoming_source VARCHAR2 Name of the incoming data source. This name must be unique.

p_list_of_triples TRIPLE_TAB_TYP List of triples expressing the contents and capabilities of the incoming data source.

52

Figure 20 Package Diagram of the Proposed Architecture for Relevance Reasoning

5.1.10.2. UNREGISTER_SOURCE(): This procedure accepts the name of the data source

and deletes the local ontology for it from the source description storage.


p_deleting_source VARCHAR2 Name of the data source to be deleted. This name must be unique.

5.1.11. PACKAGE Ontology_Management_Service: This package manages global

ontology. It provides three main procedures to perform various tasks.

5.1.11.1. REGISTER_GLOBAL_TRIPLE(): This procedure helps in publishing domain

knowledge in terms of the RDF triples. It assigns the GUID to incoming triple and reserves its

position on the bitmap index and adds it to the global ontology.

Parameter Name Data type Descriptionp_incoming_triple SDO_RDF_TRIPLE RDF triples describing the domain knowledge.

p_incoming_triple_type VARCHAR2 Type of the RDF triple

5.3.2.2. RECONCILE_GUID(): This function returns the GUID for the specified RDF triple.

It interacts with the ontology reasoning service to semantically expand RDF triple and identify its

GUID.


p_incoming_triple SDO_RDF_TRIPLE RDF triple for which GUID has to be identified.

5.3.2.3. IDENTIFY_BITMAP_POSITION(): This function accepts the GUID and returns the

bitmap position for the specified RDF triple.

Parameter Name Data type Descriptionp_incoming_triple_GUID NUMBER GUID of the RDF triple for which bitmap position has to

be identified.

5.3.3. PACKAGE Index_Management_Service: This package helps in the

management of the bitmap index in the proposed architecture. Following are the main

three procedures of this package.

53

5.3.3.1. MANAGE_BITMAP_PATTERN(): This procedure manages the bitmap pattern for

the index whenever domain knowledge is published in terms of the RDF triples.

Parameter Name Data type Descriptionp_incoming_triple_GUID NUMBER GUID of the RDF triple that has to be published in global

ontology.

5.3.3.2. CONSTRUCT_BITMAP_SEGMENT(): This procedure helps in the construction of

bitmap segment for the incoming data source. It assigns a unique identifier for each bitmap

segment. Initially all bits are initialized to 0 in the bitmap pattern.


p_incoming_source VARCHAR2 URI of the incoming data source for which the bitmap segment has to be created.

5.3.3.3. SYNCH_BITMAP_SEGMENT(): This procedure helps in the

synchronization of the local ontology RDF triples with the bitmap segment for a specified

data source. It shuffles the bits accordingly to the RDF triples of the local ontology.

Parameter Name Data type Descriptionp_source_segment VARCHAR2 Unique identifier assigned to the bitmap segment of the data

source.GUID_POS NUMBER Position of the bit on the bitmap segment that need to be shuffled.

BIT_STATE VARCHAR2 SET means 1, and UNSET means 0.

5.3.4. PACKAGE Index_Lookup_Service: This package traverses the bitmap

segments in the index for the specified RDF triple. It contains one function shown below.

5.3.4.1. TRAVERSE_BITMAP_SEGMENT(): This function accepts the position

and traverses the bitmap index on the specified position to identify those bitmap

segments where the bits are set.


GUID_POS NUMBER Position of the bit on the bitmap segment that needs to be traversed.

54

5.3.5. PACKAGE Ontology_Reasoning_Service: This package helps the architecture

to perform ontological inferencing and calculate the semantic similarity among different

terms. It contains the following functions.

5.3.5.1. GENERATE_SEMANTIC_QUERY(): This function extends the simple semantic

searching behaviour of the proposed architecture and formulates a semantic query that checks

for synonyms, lexical variants, and subclass operators along with the terms that are relevant

with some degree of likelihood.

Parameter Name Data type DescriptionP_incoming_term VARCHAR2 Term for which simple semantic query has to be generated.

5.3.5.2. GENERATE_SEMANTIC_QUERY_DOL(): This function extends the

simple semantic searching behavior and to the proposed architecture. It accepts a term

(Concepts or Relationship) and formulates a semantic query that checks for synonyms, lexical

variants, and subclass operators in their respective hierarchies over the global ontology.


P_incoming_term VARCHAR2 Term for which extended semantic query has to be generated.

5.3.5.3. FETCH_RELEVANT_TERMS(): This function executes the query that is to be

generated using GENERATE_SEMANTIC_QUERY() function and returns a list of relevant terms for

the term being reasoned.

Parameter Name Data type DescriptionP_incoming_term VARCHAR2 Terms for which semantic similarity has to be computed.

5.3.5.4. FETCH_RELEVANT_TERMS_DOL():This function executes the query

that is to be generated using GENERATE_SEMANTIC_QUERY_DOL() function and

returns a list of relevant terms for the term being reasoned.

5.3.6. PACKAGE Relevance_Reasoning_Service: This package accepts the RDF

triples of a user query and identifies the most effective and relevant data sources.

55

5.3.6.1. IDENTIFY_RELEVANT_SOURCES(): This function interacts with the ontology

reasoning service and draw inference from it to expand the query triples. It also interacts with

the index lookup service to identify the most effective and relevant data sources for these

inferred RDF triples.


p_incoming_subject VARCHAR2 Subject of the query RDF triples

p_incoming_property VARCHAR2 Property of the query RDF triples

p_incoming_object VARCHAR2 Object of the query RDF triples

5.3.6.2. IDENTIFY_RELEVANT_SOURCES_DOL(): This function interacts

with the ontology reasoning service and draw inference based on degree of likelihood

from it to expand the query triples. It also interacts with the index lookup service to

identify the most effective data sources that are also relevant with certain degree of

likelihood.

5.3.6.3. RANK_RELEVANT_SOURCE(): This functions ranks the selected data sources

based on the score being obtained for the user’s query.


p_incoming_source VARCHAR2 Relevant data source that are to be ranked

p_ranking_order VARCHAR2 DESC/ASC means descending/ascending

We have highlighted the Oracle implementation of the ontologies and RDF data.

The design and implementation along with their issues have been discussed in detail for

the proposed architecture.

56

RESULTS AND EVALUATION

In this chapter we evaluate the results of the developed prototype system, discussed in

Chapter 5. We identify main evaluation criteria, the details of data set, the query

structure, system specification and results of the experiments carried through the system.

20 System Specification

Pentium-IV

System Processor 2.4GBRAM 1GBHDD 80GBOperating System Windows 2003 (with service pack 2)Tool Oracle Spatial 10g Release 2 NDM Language PL/SQL

21 Evaluation Criteria

The main aim of this evaluation is to validate whether the proposed architecture for

the relevance reasoning can scale up to a large number of data sources and complex

queries. In order to quantitatively measure the performance of the relevance reasoning,

different evaluation measures have been used which are discussed in the subsequent

section. The evaluation criteria for evaluating our system are listed below:

6.1.1. Response Time of Query Execution: to ensure that the manipulation of RDF

triples does not mitigate query response time during relevance reasoning as the number of

sources increases for the system.

6.1.2. Accuracy of the Relevant Source Selection: to ensure that provision of

semantics does not affect the accuracy of the proposed methodology and can be checked

57

CHAPTER 6

by calculating precision and recall of the system for relevance reasoning. Precision can be

defined as the ratio of relevant data sources to the number of retrieved data sources [41]:

Whereas Recall can be defined as the proportion of relevant data sources that are

retrieved [41]:

22 Data Specification

The experiment has been carried out with a corpus of manually generated 100 data

sources. Each data source contains 30-50 RDF triples. The famous university ontology

has been used in the experiment as the domain ontology [1, 42].

23 Test Queries

We have executed 35 different queries related to the students, faculty, and research

associates data. We performed accuracy test of the proposed architecture over these test

queries. We comparatively analyzed our system with MiniCon algorithm [1], observing

the precision and recall of both the systems. Among these 35 queries, we selected 3

queries; having 3, 6, and 9 RDF triples to test the system efficiency by checking query

response time. These queries are as below:

Find name of all Instructors who are teaching a course to the same student, whom

they are advisors.

58

RDF Pattern of Query 1

(?instructor :isTeaching :Course) (?student :isRegisteredIn :Course) (?instructor :isAdvisorOf ?Student).

Find instructor-name, instructor-gender, and area of specialization of all instructors

whether they are in staff or students.

Find instructor-name, instructor-gender, and area of specialization of all instructors

whether they are in staff or student and student doesn’t have major department as

advisor working department.

24 Experiments for Response Time of Query Execution

In the experiment for evaluating the performance, we evaluated the system for query

response time from three dimensions. Firstly, queries were executed against the local

ontologies of data sources in the source description storage. We assessed the time taken

by the relevance reasoner to traverse local ontologies for relevant source selection.

Second, as our proposed methodology employs bitmap index where source descriptions

are mapped semantically in the bitmap segments as bits, we submitted the queries to

relevance reasoner using bitmap index and assessed the time taken using bitmap index.

Finally, we extended the bitmap index and implemented function-based indexing over it

59


(? instructor :hasName ? name) (? instructor :hasGender ?gender) (? instructor :hasArea ? area)UNION(? student :isAssisting :Course) (? student :hasGender ? gender) (? student :hasMajor ? depart).


((?instructor :hasName ?name) (?instructor :hasGender ?gender) (?instructor :hasArea ?area)UNION(?student :isAssisting ?Course) (?student :hasGender ?gender) (?student :hasMajor ?depart))MINUS(?instructor :isAdvisorOf ?student) (?student :hasMajor ?depart) (?instructor :hasWorkingDepart ?depart)

and then analyzed the performance of the system. Figure 21, 22, and 23 illustrates the

performance of the system with the 3 queries shown in the preceding section.

Figure 21 Time Complexity of System (Query with 3 Triples)

60



The observations showed that there is a comparative performance gain running

queries on source descriptions with bitmap index than running them directly to source

61

descriptions. While, significant performance gain observed while searching relevant

sources using extended bitmap index from both previously discussed approaches. Figure

24 shows a comparison of performance gain using extended bitmap index than the simple

bitmap index.

Figure 24 Performance gain of the system with respect to direct ontology traversal

25 Experiments for System Accuracy

In the experiment for evaluating the accuracy of the system, we have calculated the

precision and recall of our proposed methodology and made a comparison with the

MiniCon algorithm [1]. As MiniCon algorithm directly traverses the source descriptions,

therefore we did not implement it rather we used the same approach to develop the code

to traverse the local ontologies. As our proposed semantic matching process searches for

the synonyms, lexical variants, subclasses and degree of likelihood also therefore the

comparison showed an increase in both precision and recall with respect to MiniCon

Algorithm.

62

Figure 25 : Precision Vs Recall comparison of the proposed methodology with MiniCon algorithm

We have provided evaluation of the results of the developed prototype system in this

chapter. Different evaluation criteria have been identified for system evaluation. We have

compared the results of the prototype system with the existing systems. The comparison

showed that the system have better query response time and accuracy of source selection

compared to the existing systems.

63

CONCLUSION AND FUTURE DIRECTIONS

In this chapter we conclude the research thesis. It provides an analysis of results and

future directions where the thesis work can be extended. The chapter is of vital

importance because it provides a bird’s eye-view of the methodology and gives future

directions for new researchers.

26 Discussion

An exponential growth in online data sources due to advancements in information and

communication technologies (ICT) requires semantically-enabled robust and scalable

data integration. Keeping in view the cited objectives we have proposed an ontology-

driven relevance reasoning architecture that identifies the most effective and relevant data

sources for user’s query before executing it. In our proposed methodology, we plotted the

local ontologies of the data sources over the bitmap index. In spite of traversing the local

ontologies in relevance reasoning, we use bitmap index to perform the relevance

reasoning.

The proposed methodology has three workflows; (1) Ontology Management

Workflow, (2) Source Registration Workflow, and (3) Query Execution Workflow. This

division helps to understand the functionality of various components in the methodology

along with their inter-dependence on each other. The ontology management workflow

and the source registration workflow set the stage for relevance reasoning in the proposed

architecture.

The ontology management workflow publishes the domain knowledge in the form of

RDF in global ontology. It creates the concept and relationship hierarchies using the

64

CHAPTER 7

semantic operators. It also creates the rule-base to define rules and manage rules index to

perform inference and reasoning during the semantic matching process. Source

registration workflow manages the local ontologies of data source in the source

description storage. As the new sources enter and leave the system, index management

service synchronizes the bitmap index to reflect the new status of the source description

storage. In order to answer the queries precisely, bitmap index need to be

synchronized/updated with source description storage.

Query execution workflow takes the user’s query formulated in RDF triples and

identifies the most effective and relevant data sources for the given query. During

relevance reasoning queries are expanded using the inference drawn from the ontology

reasoning service. It calculates the semantic similarity between the query and source RDF

triples and identifies the relevant and effective data sources. Relevant data sources are

ranked based on the similarity score they obtained for the user query. The sorted list of

relevant and effective data sources are returned to the query rewriting component that

reformulates the queries for these relevant data sources.

27 Contributions of the Project

The first contribution of the proposed methodology is that it provides provision for

the semantic interoperability during the process of relevance reasoning. Semantic

operators are being introduced to sort out fine grained heterogeneities among the contents

of different data sources. It checks for exact matches, lexical variants, synonyms,

subclasses, and degree of likelihood during semantic matching. Ontology, rule-bases and

rules-indexes have used for semantic matching and inference during the relevance

65

reasoning. The accuracy tests of the system showed improved precision and recall than

MiniCon algorithm [1].

The second contribution of the proposed methodology is the provision for

optimization during relevance reasoning with the help of a bitmap index. Previously the

community was using the bitmap index for bulks of data management in the warehouses

of the relational models but we used bitmap index to represent the RDF models. The

bitmap index is used during relevance reasoning and improves this whole process by

traversing the plotted RDF data in an improved manner. The time complexity test showed

that bitmap indexing performs the relevance reasoning in a comparatively shorter time.

28 Future Direction

Currently our focus is on centralized bitmap indexing in data integration systems

where a single global ontology is presiding over some node and queries are reformulated

over it. As P2P DBMSs are evolving and data integration is also getting popular in these

domains, therefore in future this methodology can be extended to meet the requirements

of P2P data integration. Index partitions may be residing on each peer and collectively

they all will participate in relevance reasoning during the query processing.

66

REFERENCES

[1] Alon Halevy, Anand Rajarman, Joann Ordille. Data Integration: The Teenage years, Proceeding of 32nd international conference on VLDB, Pages 9-16, September 2006.

[2] Yaser A. Bishr. Overcoming the semantic and other barriers to GIS interoperability. International Journal of Geographical Information Science, 12(4):229{314, 1998.

[3] Thomas R. Gruber and Gregory R. Olsen. An Ontology for Engineering Mathematics. Proceeding of 4th International Conference on Principles of Knowledge Representation and Reasoning (KR 1994), pages 258-269, 1994.

[4] Tom R. Gruber. A Translation Approach to Portable Ontology Specifications. Knowledge Acquisition, pages 199-220, 1993.

[5] Natalya F. Noy. Semantic Integration: A survey of Ontology-Based Approaches. SIGMOND record, Vol. 33, pages 65-70, December 2004.

[6] Isabel F. Cruz and H. Xiao. The Role of Ontologies in Data Integration. Journal of Engineering Intelligent Systems: pages 245-252, December, 2005.

[7] M. Jamadhvaja, Twittie Senivgee. An Integration of Data sources with UML Class Models Based on Ontological Analysis. Pages 1-8, November 4, 2005, ACM, Bremen, Germany.

[8] S. Khan and F. Marvon, Identifying Relevant Sources in Query Reformulation. In the proceedings of the 8th International Conference on Information Integration and Web-based Applications & Services (iiWAS2006), Yogyakarta, Indonesia, December 2006.

[9] Wache, H., Vogele, T., et al., Ontology-Based Integration of Information — A Survey of Existing Approaches in The Seventeenth International Joint Conference on Artificial Intelligence, Seattle, Washington, USA, 2001.

[10] Arens, Y., Hsu, C.N., et al. Query processing in the SIMS information mediator. In readings in agents, Morgan Kaufmann Publishers Inc., pages 82-90, 1997, San Francisco USA.

[11] Mena, E., Illarmendi. OBSERVER: An approach for query processing in Global Information Systems based on Interoperation across Pre-existing Ontologies. IEEE, pages 19-21, 1996.

[12] F. Naumann, U.Leser, and J.C. Freytag. Quality-driven integration of heterogeneous information systems. 25th Proceeding of International Conference on VLDB, pages 447-458, Scotland, September 1999.

67

[13] Isabel F. Cruz, Huiyong Xiao, and Feihong Hsu. An Ontology-based Framework for Semantic Interoperability between XML Sources. In Proceedings of the 8th International Database Engineering and Applications Symposium (IDEAS), pages 217-226, July, 2004. IEEE Computer Society 2004.

[14] Nicola Guarino. Formal Ontology and Information Systems. In Proceedings of the 1st International Conference on Formal Ontologies in Information Systems (FOIS 1998), pages 3-15, 1998.

[15] Alon Y. Halevy. Answering queries using views: A survey. The VLDB Journal, pages 270-294, 2001.

[16] Alon Y.Halevy, Anand Rajaraman, Joann J.Ordille. Querying heterogeneous information sources using source descriptions. In the proceeding of the International conference on Very Large Databases (VLDB) 1996.

[17] Rachel Pottinger and Alon .HaLevy. MiniCon: A scalable algorithm for Answering Queries using views. VLDB Journal 2001.

[18] G. Wiederhold. Mediators in the architectures of future information systems. IEEE Computer, Pages 38-49, March 1992.

[19] J. Zhong, H. Zhu, et al. Conceptual graph matching for semantic search. In the proceedings of the 10th International conference on Conceptual Structures (ICCS), LNCS 2393, pages 92-106, Bulgaria, July 2002. Springer.

[20] A.H. Levy: Why Your Data Won’t Mix: Semantic Heterogeneity. ACM Queue 3, pages 50-58, 2005.

[21] RDF Primer. W3C Recommendation, 10th February 2004, http://www.w3c.org/RDF/

[22] Waris Ali, Sharifullah Khan, Global Query Generation over Diverse Data Sources Using Ontology. In 1st International Conference on Information and Communication Technologies, 9th June 2007, Bannu, N.W.F.P, Pakistan.

[23] Nicole Alexander, Siva Ravada. RDF Object Type and Reification in the Database. In the proceeding of 22nd Int. Conference on Data Engineering (ICDE’06). IEEE Computer Society 2006.

[24] R. Smith, T. Connolly, Data Integration Service, Book Chapter, Information management in Large Scale Enterprises. 3rd Edition.

[25] Mediator-Wrapper, http://www.objs.com/survey/wrap.htm

68

http://www.objs.com/survey/wrap.htm

http://www.w3c.org/RDF/

[26] S. Khan, F. Movan, Scalable Integration of Biomedical Sources, In the proceedings of the 8th International Conference on Information Integration and Web-based Applications & Services (iiWAS2006), Yogyakarta, Indonesia, December 2006.

[27] Jacob Kohler, Stephan Philippi, Michael Specht, Alexander Rueggd, Ontology based text indexing and querying for the semantic web? Knowledge-Based Systems 19 (2006), pages 744-754.

[28] X. Li, F. Bian, H. Zhang, C. Diot, R. Govindan, G. Iannaccone. "MIND: A Distributed Multi-Dimensional Indexing System for Network Monitoring". IEEE Infocom 2006 Barcelona April 06.

[29] XML Vocabulary Description Language 1.1 XML Schema, W3C Recommendation May 2001, http://www.w3.org/XML/Schema

[30] The DARPA Agent Markup Language Home Page. August 2000, http://www.daml.org/

[31] Web Ontology Language, W3C Recommendation, 06 September 2007. http://www.w3.org/2004/OWL/

[32] B-Tree and Bitmap Indexing. Oracle Developer Guide 10g Release 2, Part no: A969505-01, Oracle Corporation, March 2002.

[33] Jena – A semantic web framework for Java, http://jena.sourceforge.net/

[34] Kowari meta store for OWL and RDF metadata, http://www.kowari.org/

[35] Jose Kahan, Marja-Riitta, Eric Prud’Hommeaux, Ralph R. Swick. Annotate: An Open RDF Infrastructure for Shared Web Annotations, Proceedings of the WWW 10th Int. Conf., Hong Kong, May 2001.

[36] A web-based RDF browser, Longwell, http://simile.mit.edu/wiki/Longwell

[37] Oracle Semantic Technologies Network, Spatial Technology using Network Data Model, http://www.oracle.com/technology/tech/semantic_technlogies/index.html.

[38] P. Mitra. Algorithms for Answering Queries Efficiently Using Views. Technical report, Infolab, Stanford University, September 1999.

[39] F. N. Afrati, C. Li, and J. D. Ullman. Generating Efficient Plans for Queries Using Views. In ACM SIGMOD International Conference on Management of Data, Santa Barbara, CA, May 2001.

69

http://www.oracle.com/technology/tech/semantic_technlogies/index.html

http://simile.mit.edu/wiki/Longwell

http://www.kowari.org/

http://jena.sourceforge.net/

http://www.w3.org/2004/OWL/

http://www.daml.org/

http://www.w3.org/XML/Schema

[40] E. I. Chong, S. Das, G. Edon, J. Srinivasan. An Efficient SQL based RDF Querying Scheme, Proceedings of the 21st VLDB Conference, Trondheim, Norway, 2005.

[41] Giannis Varelas, Epimenidis Voutsakis, Paraskevi Raftopoulou, “Semantic Similarity Methods in WordNet and their Application to Information Retrieval on the Web”, 7th ACM international workshop on Web information and data management November 5, 2005.

70

cs.bzu.edu.pk · web viewa variety of rdf storage systems and browsers are available such as jena...

Documents