williamson thesis - our archive home

100
A Lightweight Data Integration Architecture David William Williamson a thesis submitted for the degree of Master of Science at the University of Otago, Dunedin, New Zealand. 12 June 2006

Upload: others

Post on 31-Jan-2022

5 views

Category:

Documents


0 download

TRANSCRIPT

A Lightweight Data Integration

Architecture

David William Williamson

a thesis submitted for the degree of

Master of Science

at the University of Otago, Dunedin,

New Zealand.

12 June 2006

Abstract

Content syndication specifications such as Atom have become a popular

mechanism to disseminate information across the Internet, with many sites

providing Atom feeds for users to subscribe to and consume. Such a scenario

typifies the originally intended use of Atom; however, our research has ex-

plored an alternative domain for this syndication technology. This research

has evaluated Atom for its potential as a lightweight platform to support

data integration from a set of data sources to a single target database.

The implementation of the Atom-based architecture that we developed for

this research combines freely available server-side scripting technology with

the simplified asynchronous connection scheme that content syndication

technology offers. We use several use cases each with different degrees of

complexity, yet sharing common requirements, as a guide in the develop-

ment of our prototype.

In order to evaluate our Atom-based architecture, our experimental design

required the construction of an evaluation framework that measured the

prototype’s impact upon the network and computation resources it con-

sumed. These measurements were compared with observations of response

time requirements between operational and analytical processing systems.

The experiments carried out to evaluate the Atom-based data integration

architecture have shown that the architecture has potential in facilitating

a lightweight data integration solution. Our research has shown that an

Atom-based architecture is capable of operating within a range of condi-

tions and environments, and with further development, would be capable

of greater processing efficiency and wider compatibility with other types of

data structures.

ii

Acknowledgements

I would like to take this opportunity express my gratitude to all those people

who have supported me during this research, and in particular:

My supervisor Dr. Nigel Stanger, who suggested the idea that would be-

come the core of my work, and who has been invaluable in every step of

the project.

To Dr. Noria Foukia, Dr. Colin Aldridge and Dr. Tony Moore, for giving

their own time to proof-read and critique various parts of my thesis.

Graham & Co. of the Technical Services Group, thanks guys for providing

support and resources that enabled me to complete my experiments.

My colleagues and office-mates past and present: Prajesh, Christian, Heiko,

Dr. Xin, Matt, (soon to be Dr.) Grant, Ahmad and Jacqui, thanks for some

hilarious moments, lively debate on anything and for creating a fantastic

working environment. It has been a privilege to work alongside you all and

it is my hope that we can remain friends for a long time to come.

To my family, though you weren’t always sure what on earth it was I was

doing, you supported me nonetheless, thank you.

Lastly but certainly not least, my dear Sabine, you have been a pillar

of strength to me throughout this project, although we live in an age of

sophisticated communications technology, the other side of the world is still

a great distance. Completing this thesis has meant spending considerable

time apart; I sincerely thank you for your support.

iii

Contents

1 Introduction 11.1 Purpose of Study . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11.2 Research Scope . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.2.1 Objectives . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21.2.2 Delimitations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2

1.3 Structure of Thesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 31.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4

2 Related Work 52.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52.2 The Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6

2.2.1 Semantic Web Principles . . . . . . . . . . . . . . . . . . . . . . 72.2.2 Resource Description Framework (RDF) . . . . . . . . . . . . . 82.2.3 PRISM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.4 RSS . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92.2.5 Atom . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102.2.6 Recent Developments . . . . . . . . . . . . . . . . . . . . . . . . 10

2.3 Data Integration . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122.4 Publish/Subscribe . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 212.5 Update Propagation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 252.6 Data Streaming . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 262.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3 System Design 313.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 313.2 Use Cases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.2.1 Movie Timetable e-Catalogue . . . . . . . . . . . . . . . . . . . 323.2.2 MP3 Music Retail System . . . . . . . . . . . . . . . . . . . . . 323.2.3 Electronics Retailer Data Warehouse . . . . . . . . . . . . . . . 33

3.3 Requirements Summary . . . . . . . . . . . . . . . . . . . . . . . . . . 343.4 Development . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.1 The Development Environment . . . . . . . . . . . . . . . . . . 343.4.2 Implementation Rationale . . . . . . . . . . . . . . . . . . . . . 353.4.3 The Feed Builder Module . . . . . . . . . . . . . . . . . . . . . 363.4.4 The Feed Consumer Module . . . . . . . . . . . . . . . . . . . . 413.4.5 System Configuration . . . . . . . . . . . . . . . . . . . . . . . . 43

iv

3.5 Discussion . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 443.6 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4 Experimental Design 484.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 484.2 Experiment Rationale . . . . . . . . . . . . . . . . . . . . . . . . . . . 494.3 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.3.1 The Load Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 514.3.2 The Operational Test . . . . . . . . . . . . . . . . . . . . . . . . 524.3.3 The Latency Test . . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

5 Results 565.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2 Findings . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

5.2.1 The Load Test . . . . . . . . . . . . . . . . . . . . . . . . . . . 565.2.2 The Operational Test . . . . . . . . . . . . . . . . . . . . . . . . 595.2.3 The Latency Test . . . . . . . . . . . . . . . . . . . . . . . . . . 59

5.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61

6 Conclusion 626.1 Discussion of Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . 626.2 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 656.3 Recommendations and Conclusions . . . . . . . . . . . . . . . . . . . . 67

References 69

A Music Kiosk Use Case 76A.1 MP3 Kiosk Project Documentation . . . . . . . . . . . . . . . . . . . . 76A.2 Schemas . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88

A.2.1 Kiosk Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88A.2.2 Source Schema . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

B Experiment Data Sets 91B.1 Load Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91B.2 Operational Test . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92B.3 Latency Test Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

C Example RSS Feed 93

v

List of Figures

2.1 A simple, single entry Atom feed document (Nottingham and Sayre, 2005). 112.2 Prototypical Architecture of a Data Integration System (Levy, 2000). . 152.3 Wiederhold’s (1993) three tier integration architecture “I3” utilising me-

diators. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 162.4 Inputs and Outputs of Data Integration (Batini, Lenzerini and Navathe,

1986). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 172.5 Interfacing Strategies (Pascoe and Penny, 1990). . . . . . . . . . . . . . 192.6 A data integration framework using publish/subscribe (Vargas, Bacon

and Moody, 2005). . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 232.7 Golab and Ozsu’s (2003a) DSMS architecture. . . . . . . . . . . . . . . 272.8 Apama financial analysis system (Progress Software, 2006). . . . . . . . 29

3.1 Overview of the basic architecture . . . . . . . . . . . . . . . . . . . . . 373.2 Flow chart of Atom feed builder . . . . . . . . . . . . . . . . . . . . . . 383.3 Example Atom entry from the MP3 kiosk use case. . . . . . . . . . . . 403.4 Flow chart of Atom feed consumer . . . . . . . . . . . . . . . . . . . . 42

5.1 Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 575.2 Comparison of Outputs . . . . . . . . . . . . . . . . . . . . . . . . . . . 585.3 Network Traffic Generated by Atom Prototype . . . . . . . . . . . . . . 585.4 Packets Generated by Atom Prototype . . . . . . . . . . . . . . . . . . 605.5 Update latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

vi

Chapter 1

Introduction

1.1 Purpose of Study

Atom is a content syndication specification intended to provide a simple means to read

and write information on the World Wide Web (WWW). The benefit of a specification

like Atom is that it allows users to easily remain up to date with the latest informa-

tion from many web sites, as well as easily publishing their own content for others to

consume (AtomEnabled, 2005).

This research has observed Atom in a context slightly removed from its web-centric

content origins; Atom has been evaluated for its potential as a lightweight platform

to support data integration, by means of asynchronous update propagation, between

relational databases. An asynchronous approach is easily scalable because of its general,

simplified support infrastructure; allowing connections between objects to be decoupled

in terms of synchronization, space and time.

Syndication presents a further simplified asynchronous framework that removes ad-

ditional infrastructure, like that found in publish/subscribe systems, between objects.

However, syndication technology retains the advantages of scalability associated with

asynchronous connection schemes. We have combined this simplified asynchrony with

the low cost, platform independent technology Hypertext Preprocessor (PHP).

The collective advantages of the scalability of an asynchronous approach, the sim-

plified infrastructure afforded by syndication technology such as Atom, and the feature-

rich technology of PHP, give rise to an avenue to create a data integration solution that

is lightweight in terms of the extent of impact on organisation’s available resources.

1

1.2 Research Scope

1.2.1 Objectives

As mentioned previously, the purpose of this research was to evaluate Atom for its

potential to facilitate data integration. In order to achieve this, the objective was

further abstracted and defined into two sub-goals:

1. To investigate the degree of potential an Atom-based approach has as a data

integration architecture that is lightweight in terms of tangible (network and

computational) resource requirements.

2. To infer from the degree of potential exposed by the prototype whether it is

worthwhile pursuing the use of syndication technology such as Atom in the do-

main of database integration.

In order to achieve these objectives, the following sub-goals were defined:

1. Identify appropriate technologies that are readily available and capable of deliv-

ering the features needed in the prototype Atom-based architecture.

2. Construct an appropriate framework to evaluate and examine the behaviour of

an implementation of Atom-based architecture under various loading conditions,

configurations and use cases.

3. Incorporate suitable use cases into the framework, i.e., identify scenarios within

the scope of usage the Atom-based architecture is intended for.

1.2.2 Delimitations

The scope of the research was narrowed primarily in regards to the schema and data

types used when evaluating the Atom-based architecture:

1. The research did not look at security features such as authentication or encryp-

tion.

2. The structure of data sources used in testing was restricted to relational type

schemas.

3. Data types used in both the data sources and the target were restricted to al-

phanumeric text.

2

4. The direction or flow of information is one-way, that is the focus has been on

propagating data from a source to a target and not back the other way.

1.3 Structure of Thesis

The content of this thesis has been organised into six chapters. Related work is covered

initially before details specific to this research are presented. The thesis concludes with

a discussion of the implications of the results obtained from this research.

Chapter 2 presents a series of topics from related work to illustrate and discuss

where both this research and the Atom specification are positioned. Initially precur-

sor technologies to the Atom specification are presented (the Semantic Web) before

the related fields of data integration, update propagation and the publish/subscribe

paradigm are discussed. The final section presents data streaming to illustrate a con-

trasting technology.

Chapter 3 details the implementation of the Atom-based architecture that we de-

veloped for this research. We combined the freely available, feature rich technology

PHP with the simplified asynchronous connection scheme that content syndication

technology offers to create our data integration prototype.

Chapter 4 describes the experimental design, which required the construction of

an evaluation framework that measured the prototype’s impact upon the network and

computation resources it consumed. These measurements were used in conjunction

with observations of response time differences between operational and analytical pro-

cessing systems.

Chapter 5 presents the results of the three different experiments carried out to eval-

uate the prototype implementation of the Atom-based data integration architecture.

The experiments focussed on obtaining data pertaining to the responsiveness (latency)

of the system and its impact on the network and computational resources available in

its immediate environment.

Chapter 6 concludes our work by summarising and discussing the results of our

research and its implications, before suggesting recommendations and directions for

further investigation.

3

1.4 Summary

The purpose of this research was to evaluate the Atom content syndication specification

for its potential as a platform to facilitate lightweight data integration architecture. The

term lightweight was defined in terms of impact on network and computing resources.

The extent of potential was used to infer if pursuing an Atom-based architecture beyond

a prototypical status was worthwhile.

The scope of the evaluation restricted the prototype Atom-based architecture to

focus on integrating data of alphanumeric text between source and target relational

databases, and has not addressed data security issues. The next chapter will present

topics from research work related to the Atom specification and data integration.

4

Chapter 2

Related Work

2.1 Introduction

This review of related work has been organised into five main subject areas. The

first presented in Section 2.2 discusses the Semantic Web, its core design principles

and the technology that is intended to implement it; namely Resource Description

Framework (RDF). This discussion is presented to give some background to the Atom

specification (Section 2.2.5), which has been built on Semantic Web technology and

plays an important part in this research activity.

Section 2.3 introduces the topic of data integration, which is the domain that the

Atom specification has been adapted for in this research. Data integration refers to

the problem of trying to provide a user with a unified view of data that may be stored

in multiple locations and in differing formats. This research activity has attempted to

evaluate Atom for its potential as a lightweight platform to support data integration

by means of asynchronous update propagation from a series of data sources to a single

target database.

Publish/subscribe is then discussed in Section 2.4, which is a paradigm that has

received significant attention recently for its claimed ability to provide a flexible and

highly scalable framework for large distributed information systems. The publish/sub-

scribe architecture uses a common framework that subscribers use to register their

interest in the occurrence of specific events.

The next topic is update propagation introduced in Section 2.5. Update propagation

refers to the problem of updating copies of an object and is commonly associated with

distributed systems. The problem is centred on the need to ensure that if a change

is made to an object (e.g., a row is updated in a table) then that change must be

disseminated to all other copies of that object in the system.

5

Data streaming is the final topic to be discussed in Section 2.6. A significant area

of research in its own right, information systems increasingly have to be able to process

types of data that are highly dynamic and transient.

2.2 The Semantic Web

The World Wide Web (WWW) as it stands today is comprised mostly of documents in-

tended for humans to read, which allows minimal opportunity for computers to perform

additional interpretation or processing on them (Koivunen and Miller, 2001; Berners-

Lee and Fischetti, 1999). In essence, computers in use on the Web today are primar-

ily concerned with the parsing of elementary layout information, for example head-

ers, graphics or text, or user input form processing (Berners-Lee, Hendler and Las-

sila, 2001).

There are few means by which computers can perform more meaningful processing

on web resources (Berners-Lee et al., 2001; Fensel, Hendler, Lieberman and Wahlster,

2003) most often because the additional meaning (semantics) required does not exist

or is not in a form that can be interpreted by them (Koivunen and Miller, 2001).

The motivation for the adoption of semantics can be made evident simply by using

a contemporary search engine to look for an address. This search will likely return

a variety of results ranging from street addresses, email addresses to public addresses

made by important individuals through the ages.

This kind of scenario is one of the reasons for the World Wide Web Consortium’s

(W3C) Semantic Web project (Koivunen and Miller, 2001). In the words of its creator,

Tim Berners-Lee, its goal is to:

“. . . develop enabling standards and technologies designed to help machines

understand more information on the Web so that they can support richer

discovery, data integration, navigation and automation of tasks. With Se-

mantic web we not only receive more exact results when searching for in-

formation, but also know when we can integrate information from different

sources, know what information to compare, and can provide all kinds of au-

tomated services in different domains from future home and digital libraries

to electronic business and health services.” (Koivunen and Miller, 2001, p.

27).

In other words, the Semantic Web will provide a mechanism in which more intelli-

gent searching and processing of information will be possible by further extending the

existing capabilities of the World Wide Web (WWW).

6

2.2.1 Semantic Web Principles

The W3C have outlined several assumptions that form the basis for how the Seman-

tic Web will operate; firstly that any (physical or abstract) object or concept can in

some way be referred to through the use of Uniform Resource Identifiers (URI’s). One

common example of a URI is the Uniform Resource Locator (URL) of a web page.

Closely aligned to this principle is the premise that resources and the links between

them can have types, for example, currently the Web is comprised of hyperlinks and

resources, and often the resources are documents which are oriented more toward hu-

man interaction (i.e. for reading by a user). Documents like those just described often

lack any additional data that machines could use to derive what kind of document it is

or what its relationship may be with other documents or resources. The Semantic Web

will remedy this situation by providing the capability to append additional metadata

providing computers with a means to perform further automation of tasks.

Like the contemporary WWW, the Semantic Web is unbounded with the possibility

of any number of different types of links between differing resources. Also, like the

WWW, links to resources may change, be used for something else or disappear entirely,

thus the Semantic Web must be able to tolerate volatility of the data that are held

within it.

Trustworthiness of a resource can be scrutinized by the application that intends

to process that resource’s information. Applications evaluate the trustworthiness of a

resource by looking at statements or assertions made about that resource. For example,

who has said what, when it was said and what authority allows such a statement to

be made by that particular entity about that resource.

The descriptive conventions used by the Semantic Web allow the creation of vo-

cabularies that can grow to accommodate the ever-expanding breadth of human un-

derstanding. In addition, a vocabulary used by one party can be combined with other

vocabularies (Berners-Lee, Connolly and Swick, 1999) in order to alleviate ambiguity

or inconsistency between parties.

The final principle outlines the W3C’s intention to standardise only what is deemed

necessary, allowing the Semantic Web to evolve and grow freely.

These principles are implemented through the use of specific web technologies and

standards developed by the W3C. The rest of this section will outline the relationships

of these components to the Semantic Web.

7

Key Components

The following list is taken from The OWL Web Ontology Language Overview (McGuin-

ness and van Harmelen, 2004), and provides a description of the technologies and

standards that are to be used to implement the Semantic Web:

• XML provides a surface syntax for structured documents, but imposes no seman-

tic constraints on the meaning of these documents.

• RDF is a data model for objects (“resources”) and relations between them, it

provides a simple semantics for this data model, and these data models can be

represented in XML syntax.

• RDF Schema is a vocabulary for describing properties and classes of RDF re-

sources, with a semantics for generalization-hierarchies of such properties and

classes.

• OWL is the Web Ontology Language. Though not directly related to this re-

search, it is an important component of the Semantic Web and is intended to be

used when information contained in documents needs to be processed by applica-

tions rather than having the documents’ content presented in human-consumable

form.

2.2.2 Resource Description Framework (RDF)

RDF is a technology that is an integral part of the W3C Semantic Web initiative,

as the following excerpt from the W3C Semantic Web activity statement, by Powers

(2003), will attest:

“The Resource Description Framework (RDF) is a language designed to

support the Semantic Web, in much the same way that HTML is the lan-

guage that helped initiate the original Web. RDF is a framework for sup-

porting resource description, or metadata (data about data), for the Web.

RDF provides common structure that can be used for interoperable XML

data exchange” (Powers, 2003, p. 1).

What RDF does in the context of the Semantic Web is to provide the capability

of recording data in a way that can be interpreted easily by machines, which in turn

8

provides an avenue to “. . .more efficient and sophisticated data interchange, searching,

cataloguing, navigation, classification and so on. . . ” (Powers, 2003, p. 14).

The concept forming the basis for RDF model structure is that an entity being

described will have properties, and those properties will have values associated with

them. To formalise this concept, the RDF description statements consist of triples,

namely the subject, the predicate and the object. The subject part holds data about

what sort of entity this description is about (e.g. a document, a person etc.). The

predicate part contains a property of the subject (date created, name etc.) and the

object contains a value for the property (Manola, Miller and McBride, 2004).

PRISM and RSS 1.0 are two examples of applications resulting from RDF since

RDF’s inception in the late 1990’s.

2.2.3 PRISM

Publishing Requirements for Industry Standard Metadata (PRISM) is a metadata spec-

ification developed in the publishing industry. The specification was intended to help

publishers easily use their content in different ways and therefore improve the return on

the initial investment of creating the content to start with (IDEAlliance, 2006; Manola

et al., 2004).

2.2.4 RSS

RDF Site Summary (RSS) is an XML application, of which versions 0.90, 1.0 and

1.1 conform to the W3C’s RDF specification and is a format intended for metadata

description and content syndication (Manola et al., 2004). Originally developed by

Netscape as a means to syndicate content from multiple sources onto one page (Powers,

2003), RSS has been embraced by other individuals and organisations resulting in the

creation of multiple versions.

As a consequence, there are now two branches of the RSS standard. Versions 0.90,

1.0, 1.1 constitute the first branch. The second branch contains versions 0.91 - 0.94

and 2.0.1, commonly referred to as RSS 2.0, copyrighted by Harvard University and is

considered frozen.

At its most simple, the information provided in an RSS document consists of the

description of a channel (that could be on a specific topic such as current events, sport

or the weather, etc. ) consisting of items (e.g. a news headline) linked to other resources

(e.g. the related news article). Each item is comprised of a title, a link to the actual

9

content and a brief description or abstract. Appendix C contains an example RSS

document taken from the W3C website. Because of the proliferation of differing RSS

standards and associated problems with compatibility, a group of service providers,

tool vendors and independent developers have initiated the development of a separate

syndication standard called Atom.

2.2.5 Atom

The Atom specification is an XML-based document format that has been designed to

describe lists of related information (Nottingham and Sayre, 2005). These lists have

a URL and are accessed via HyperText Transfer Protocol (HTTP), i.e. over the Web,

and are known as feeds. Feeds are made up of multiple items, known as entries; each

entry can have an extensible set of attached metadata (Nottingham and Sayre, 2005).

Figure 2.1 shows an example of a simple, single-entry Atom feed document.

Atom as a technology comprises four key related components: a conceptual model

of a resource, a well defined syntax for this model, the actual Atom feed format and

an editing protocol. Both the feed format and the editing protocol make use of the

aforementioned syntax.

The latest specification of Atom (1.0) is a successor to the initial version (0.3),

which at the time of writing, was still in draft form, and states the main purpose that

Atom is intended to address is “. . . the syndication of Web content such as weblogs

and news headlines to Web sites as well as directly to user agents” (Nottingham and

Sayre, 2005). The specification also suggests that Atom should not be limited to just

web based content syndication but in fact may be adopted for other uses or content

types. A detailed comparison of the Atom and RSS 2.0 specifications can be accessed

from the official Atom website1.

2.2.6 Recent Developments

Recently both Google Inc. and Microsoft Corp. announced the release of APIs and

specifications that are based on content syndication technologies (Atom and RSS re-

spectively) to support the dissemination of data via the WWW.

Microsoft released the draft specification for SSE (Simple Sharing Extensions) ver-

sion 0.9 in November 2005, followed by version 0.91 in January 2006. SSE is a set

of extensions to the RSS 2.0 and Outline Processor Markup Language (OPML) 1.0

1http://www.atomenabled.org

10

<?xml version="1.0" encoding="utf-8"?>

<feed xmlns="http://www.w3.org/2005/Atom">

<title>Example Feed</title>

<link href="http://example.org/"/>

<updated>2003-12-13T18:30:02Z</updated>

<author>

<name>John Doe</name>

</author>

<id>urn:uuid:60a76c80-d399-11d9-b93C-0003939e0af6</id>

<entry>

<title>Atom-Powered Robots Run Amok</title>

<link href="http://example.org/2003/12/13/atom03"/>

<id>urn:uuid:1225c695-cfb8-4ebb-aaaa-80da344efa6a</id>

<updated>2003-12-13T18:30:02Z</updated>

<summary>Some text.</summary>

</entry>

</feed>

Figure 2.1: A simple, single entry Atom feed document (Nottingham

and Sayre, 2005).

11

specifications (Ozzie, Moromisato and Suthar, 2005). The goal of SSE is to provide

a basic, minimum set of extensions “. . . to support loosely-cooperating apps [applica-

tions]” (Ozzie et al., 2005, p. 1).

The proposed model of usage is very simple; the SSE website (Ozzie et al., 2005)

provides an example of a usage model comprising two nodes (the term “endpoints”

rather than “nodes” is used by Microsoft). Both nodes in the scenario wish to share

data with the other; to do this, each node publishes an RSS 2.0 feed containing those

data, along with the SSE mark-up. The SSE data contains information that is used

to by the nodes to synchronise each others’ items. The framework to facilitate the

synchronisation is created simply by each node subscribing to the other node’s feed.

Google released their GData protocol in April 2006 with an API that is currently

in a beta stage of development (Google, 2006). Unlike Microsoft’s SSE, the GData

protocol is based upon both Atom 1.0 and RSS 2.0, and also makes use of the Atom

specification’s publication protocol (Google, 2006). The GData protocol also provides

basic querying functionality.

Interestingly, both the GData and SSE documentation use calendar data synchroni-

sation scenarios as example uses of the technologies (Ozzie et al., 2005; Google, 2006).

More importantly however, the example scenarios represent a move away from more

conventional uses of syndication technologies; the calendar scenarios show the use of

Atom and RSS based technology to disseminate data bi-directionally between appli-

cation systems, rather than the more usual unidirectional publication of data as seen

with a news feed or personal web log.

Additionally, the fact that two of arguably the world’s most high profile technology

companies have developed similar enhanced syndication specifications and protocols

further emphasizes the growing focus on the use of content syndication technologies as

a model to disseminate data.

A possible use of such technologies is in the area of data integration, which is

discussed in the next section, and this is the domain to which the Atom specification

has been applied in this research.

2.3 Data Integration

Data integration is a term used to describe the act of combining data from different

sources in order to provide the user with a unified view of those data (Batini et al., 1986;

Yu and Popa, 2004; Lenzerini, 2002). The main advantage of a data integration system

12

is that it enables a unified interface (Levy, 2000; Friedman, Levy and Millstein, 1999)

to the user of disparate data sets, which in turn allows simpler querying of the data.

In this context, simpler querying means that there is less cognitive workload placed

upon the user. This workload reduction is achieved by the fact that the user no longer

has to deal with the issue of knowing where the data are and how to get them; rather

they can focus on what they actually want to retrieve. This activity is becoming

increasingly important to modern business operations, as more organizations become

reliant on systems to support staff in making important business decisions. These

systems and applications often require the assimilation of diverse sets of data (Yu and

Popa, 2004; Calvanese, Giacomo, Lenzerini, Nardi and Rosati, 1998; Wiederhold, 1995).

The research domain of data integration has been an active topic for some time (Beck,

Weitzal and Konig, 2002; Wiederhold, 1993; Ullman, 1997); today this domain is of

no less significance with many organizations requiring the aggregation of data from

multiple and often heterogeneous sources, for a wide variety of applications (Haas,

Miller, Niswonger, Roth, Schwarz and Wimmers, 1999).

A simple example of data integration at work is a searchable electronic library

catalogue2. Often such systems will also search through other remote sources such

as other library catalogues or journal article databases. The search results from each

source are then be integrated and presented to the user on their computer monitor.

Inmon (1993) also discusses how data integration is a necessity in the functionality

of a data warehouse. Data coming into a data warehouse need to be put through

an integration process to ensure that they can be inserted into the data warehouse,

thus allowing the data to be used for the purposes they are now intended for, such as

decision support.

These two examples also illustrate two distinct design philosophies towards data

integration architecture in terms of a temporal aspect. The library catalogue is pur-

posely built to be a responsive “on-demand” system where the user can perform ad-hoc

searches for book references and retrieve a result in real-time. In the data warehouse

environment, however, various operations and requests (e.g., insertion of new records,

querying the data etc) would happen at a much lower frequency, e.g., generating a

monthly sales report. Although both these approaches differ in temporal context, they

are both similar in terms of how they can be implemented; presently a common method

to facilitate data integration is with a “mediated” approach (Wiederhold, 1993; Widom,

1995).

2An example library catalogue system is at: http://otago.lconz.ac.nz/

13

Lenzerini (2002) formally defines a mediated schema-based data integration system

I as: I = (G, S, M) where G = Global Schema, S = Source Schema and M = Mapping

between G and S.

This approach uses a mediator that is placed between the source data and the global

schema. The mediator can help provide a mapping between the source and target

schemas that specifies where and what to extract from the source, and a description

of the rules that need to be followed in order to perform a valid transformation of

the data. Figure 2.2 provides an example of prototypical mediated data integration

architecture; it summarises techniques that have been previously illustrated by other

authors such as the “I3” architecture of Wiederhold (1993) shown in Figure 2.3.

Even earlier, traces of mediator based architecture can be found within work from

Batini et al. (1986), as shown in Figure 2.4. From these examples, a pattern starts to

emerge in regard to the structure of a data integration architecture. Three phases or

layers comprise a generic data integration architecture:

1. The data sources

2. The mediator framework

3. The global schema

The way a mapping between a source and a global schema is specified is very

important as it will determine how the source/s can be queried and what kind of data

can actually be collected (Lenzerini, 2002). As a consequence, the ability to model a

mapping specification receives significant attention within the data integration research

community.

Two common approaches mentioned in the literature for specifying mappings are

the Global-As-View (GAV) and Local-As-View (LAV) methods. A GAV approach

specifies the global (mediated) schema in direct reference to the data sources (Duschka,

Genesereth and Levy, 2000). Specifically, each item that constitutes the global schema

is associated with a particular view over the data sources (Cali, Calvanese, De Gia-

como and Lenzerini, 2002). For additional examples of GAV see the work of Adali,

Candan, Papakonstantinou and Subrahmanian (1996); Goh, Bressan, Madnick and

Siegel (1999); Tomasic, Raschid and Valduriez (1998); and Chawathe, Garcia-Molina,

Hammer, Ireland, Papakonstantinou, Ullman and Widom (1994).

The LAV approach conversely specifies the relationships between the global schema

and the data sources relative to the global schema itself, i.e. rather than constructing

14

Wrapper Wrapper

Query Execution Engine

Query Reformulation

Query Reformulation

Query in the

source schema

Query in mediated schema

Local Data Model

Global Data Model

Query in the

union of exported

source schemas

Distributed

query execution plan

Query in the exported

source schema

Figure 2.2: Prototypical Architecture of a Data Integration System

(Levy, 2000).

15

Layer 3 Independent applications on workstations - managed by

decision makers

Layer 2 Multiple mediators - managed by decision makers

Layer 1 Multiple databases - managed by decision makers

Result → decision making

Input ← Real world changes

Network services and information

Network services and information

Figure 2.3: Wiederhold’s (1993) three tier integration architecture

“I3” utilising mediators.

16

Database Integration

A Global Database

Schema

Data Mapping

from Global

to Local Databases

Mapping of

queries/transactions

from global to

local databases

Local Database

Schemas

Local Database

Queries/Transactions

Figure 2.4: Inputs and Outputs of Data Integration (Batini et al.,

1986).

17

a global schema from the data sources, a global schema is previously defined then the

sources are described as views over the global schema. Cali, Calvanese, Giacomo and

Lenzerini (2002) give a detailed formal description of LAV.

The discussion by Levy (2000) of the comparison between the LAV and GAV ap-

proaches is typical of that found in other works such as that of Lenzerini (2002); Cali,

Calvanese, De Giacomo and Lenzerini (2002); Ullman (1997); and Cali, Calvanese,

Giacomo and Lenzerini (2002). Regarding GAV, the main advantage of this approach

is that query reformulation over the participating sources is very simple (Levy, 2000).

However, there is a disadvantage to this approach in terms of the difficulties of scaling

a system to include additional data sources. This is because for each source a specifi-

cation needs to be built that shows all the possible combinations that the source can

be used in in reference to all the other relations in the mediated schema. This issue

has been further quantified in the literature; Pascoe and Penny (1990) illustrated prob-

lems associated with various interfacing strategies used for performing data translation.

Their work identified three possible interfacing strategies which resemble several of the

patterns already discussed in regards to schema mapping and transformation. Figure

2.5 illustrates those three identified strategies.

The “Individual” strategy can be compared to the GAV approach in which each

source has a specification that dictates interaction with other sources in order to con-

struct a global schema. It is this particular example of the individual strategy that

highlights the scalability issues associated with a pure GAV approach. Here, we can

see that in the individual strategy a total of N(N − 1) individual interfaces need to be

constructed in order for the strategy to work correctly.

In the LAV approach the integration system is charged with the responsibility

of defining how each of the data sources interacts with each other; therefore there

is no need to manually specify those interactions as in GAV. This greatly improves

the potential of a LAV-based approach to scale easily. The disadvantage of the LAV

approach is that unlike GAV, query reformulation is a more complicated exercise. This

is because in an LAV setting the only information about the underlying data held

in the global schema are the views used to represent each source (Lenzerini, 2002).

The views can only provide some information about the data source. This situation

therefore often dictates the need to perform additional queries to obtain information

on how to actually use the sources to acquire an answer to the query posed. Referring

back to Figure 2.5, an LAV approach can be compared with the “Interchange Format”

strategy. This comparison is made evident if the central node can be interpreted as the

18

(a) Individual

(b) Ring

(c) Interchange Format

a data format

an interface

an interchange format

Key:

Figure 2.5: Interfacing Strategies (Pascoe and Penny, 1990).

19

global schema and the outlying nodes as the views of the data sources that the schema

would interact with.

Additional derivatives of the GAV and LAV approaches have been presented in the

literature. Global and Local As View (GLAV) is a generalised combination of both the

GAV and LAV approaches (Madhavan and Halevy, 2003); furthermore, Levy provides

a good explanation and discussion of this derivative: “it combines the expressive power

of GAV and LAV, and the query reformulation is the same as LAV ” (Levy, 2000,

p. 582). The Both-As-View (BAV) approach (McBrien and Poulovassilis, 2003) has

been described as “a pathway of primitive transformation steps applied in sequence”

(Boyd, Kittivoravitkul, Lazantis, McBrien and Rizopoulos, 2004, p. 83) in which the

transformation process is built up in a series of discrete steps. At each step, a schema

construct is altered in some manner, e.g. by renaming or deleting it. Alongside each

change is a new query that specifies the extent of that change relative to the rest of

the schema.

The amount of research targeting how to model the mapping between sources is

indicative of the importance of this step in the integration process. The integration

systems that these methods are used in can be further classified; Batini et al. (1986)

illustrated three types of data integration:

1. Homogeneous, where all the sources of data share the same schema.

2. Heterogeneous, where data must be integrated from sources that may use dif-

ferent schemas or platforms (e.g., a combination of relational and hierarchical

databases).

3. Federated, where integration is facilitated by the use of a common export schema

over all data sources (i.e. the mediated approach).

The prototype system built and tested as part of this study (which will be discussed

in more depth in Chapter 3) falls under the homogeneous umbrella because all the data

sources used in testing had the exact same schema, albeit populated with different

content. Further to Batini’s classification of data integration scenarios, the prototype

exhibits traits of a federated approach where a common export schema, facilitated via

an Atom feed, is used to expose data from the sources for the target to consume. In

other words the Atom feed acts as a mediator between the data sources and the target.

Mediation is effected via two autonomous processes: an Atom feed builder that is used

to construct an Atom feed from source data, and a feed consumer used at the target

20

end to extract the source data and map it to the targets’ schema. It should also be

noted that mediation in this instance is only in one direction flowing from the source

to the target data. This reflects the nature of the publish/subscribe architecture that

the Atom standard was originally intended for and which will be discussed next.

2.4 Publish/Subscribe

The publish/subscribe architecture, also known as implicit invocation in software en-

gineering circles (Campailla, Chaki, Clarke, Jha and Helmut, 2001), has received sig-

nificant attention recently for its claimed ability to provide a flexible and highly scal-

able framework for large distributed information systems (Eugster, Felber, Guerraoui

and Kermarrec, 2003; Vargas et al., 2005; Wang, Jin and Li, 2004; Farooq, Parsons

and Majumdar, 2004; Gupta, S., Agrawal and El Abbadi, 2004; Ge, Ji, Kurose and

Towsley, 2003). The architecture uses a common framework, not unlike the mediator

approach mentioned in Section 2.3, that subscribers use to register their interest in the

occurrence of specific events.

Publishers provide notification to the framework when a new event has occurred.

The framework itself manages tasks associated with matching the descriptions pro-

vided by the subscribers to the content being made available by the publishers. These

tasks can be undertaken by using broker agents like the systems SIENA and Gryphon

mentioned by Baldoni, Contenti and Virgillito (2003). As described by Eugster et al.

(2003), a publish/subscribe system can provide a high level of scalability because it

can decouple objects participating within the system on three different dimensions:

Time: Objects do not need to be active within the system at the same time in order to

interact with the other participating objects. For example a publisher object may

publish something while a subscriber object is offline. Conversely, a subscriber

can receive publication event notifications when the publisher is offline.

Space: Objects interacting through the system do not need knowledge of the other

objects’ existence. This is because the participating objects (publishers and sub-

scribers) don’t interact with each other directly but rather through the event

service provided by the publish/subscribe architecture. Therefore, the event ser-

vice is a mediator that can manage the publication and dissemination of events

and data.

21

Synchronisation: This dimension regards how objects interact with each other. A

publishing object does not become unavailable whilst it elicits an event. In ad-

dition, consumers (subscribers) can receive notification of an event at any time

after that event was posted. Thus events and data are propagated and processed

in a completely asynchronous manner.

The loose event propagation afforded by the publish/subscribe paradigm is ideal

for large distributed systems, as it removes the costly overhead involved in maintaining

synchrony of distributed objects attempting to interact with each other.

Publish/subscribe systems fall into three main categories (Eugster et al., 2003):

Topic Based: A topic based system organises information through a set of predefined

subjects (topics) where each subject represents a distinct information channel.

Therefore a subscriber would look for and subscribe to a particular channel or

channels that would best fit their information requirements.

Content Based: A content based approach as the name suggests relates the sub-

scription definitions directly to the content of the exchange information itself.

Therefore channels are not formally structured or defined like a topic based ap-

proach, but instead have a dynamic logical representation.

Type Based: Eugster et al. (2003) suggested the type based approach as a potential

substitute for a topic based system. A type based system looks at the actual

structure of the event or information being passed, then groups items according

to the structures it identifies.

Vargas et al. (2005) describe a novel application of the publish/subscribe paradigm

that is particularly relevant to this research, as it adapts the publish/subscribe para-

digm to the data integration domain.

Figure 2.6 illustrates an architecture developed by Vargas et al. (2005) to inte-

grate several PostgreSQL databases. The system makes use of an Active Predicate

Store which stores and maintains definitions of conditions that are of interest, what

possible actions to take when a condition is met, and finally what notifications to

publish through to the database’s Hermes adapter. Hermes is the publish/subscribe

infrastructure used to facilitate the data integration functionality.

For example, when a change in state of the database occurs (e.g. an update), a

trigger is fired which in turn is associated with one (or possibly more) predicates held in

22

Hermes Adapter

Message Queue

Active Predicate Management

Database Storage

Notification Builder

Cache ManagerCondition EvaluatorNotification

Evaluator

Dynamic reactive behaviour

Triggers Functions

Active predicate storeUser Tables

Publisher Node

Figure 2.6: A data integration framework using publish/subscribe

(Vargas et al., 2005).

23

the Active Predicate Store. If the condition matches that defined within the predicate

then any actions associated with that predicate are executed. The final step is the

publication of notification messages which are sent to the Hermes system indicating

the details of the change that occurred in the database.

The approach uses (Event-Condition-Action) rules to describe the state change

situations to be monitored. It also provides space for additional conditions to be

evaluated when an event is detected and possible actions to take if the condition is

met. These rule definitions are housed within the Active Predicate Store, which is

located at the database storage level; they are essentially a set of specialist tables used

to maintain the event monitoring definitions for a particular database.

Above this is an Active Predicate Management layer which manages triggers asso-

ciated with the database, as well as features to evaluate a rule’s conditions in the event

of a trigger being activated. This layer also manages the creation of publish notification

messages that are sent to the Hermes adapter. A message queue is placed between the

Active Predicate Management layer and the Hermes adapter to facilitate exactly-once

delivery of notification messages from the Active Predicate management layer. The

notification messages themselves are XML based, which potentially means that the

messages could also be sent to systems other than Hermes. The Hermes adapter is

responsible for transforming the notification messages coming from the database into

a format appropriate for the Hermes system to use. This then enables the notifications

to be disseminated to subscribers of that particular information.

To summarise, the Hermes-based system described by Vargas et al. (2005) provides

an asynchronous, event-based system that pushes notification messages from a database

to the Hermes system for subscribers to consume. Hermes itself is classed as a content

based publish/subscribe system, which means the subscribers can describe their specific

information requirements with respect to the content of interest.

This section has discussed the publish/subscribe paradigm and has used the Her-

mes-based data integration system presented by Vargas et al. (2005) to provide a perti-

nent example of the publish/subscribe paradigm put to use within the domain of data

integration. To recap, a publish/subscribe architecture makes use of a common frame-

work that is used by subscribers to register their interest in the occurrence of events

with. Publishers provide notification to the framework when a new event has occurred.

The framework itself manages tasks associated with matching the descriptions provided

by the subscribers to the content being made available by the publishers.

A comparison of the Hermes-based system and the Atom prototype will be presented

24

later in Chapter 3. Both share the common feature that they facilitate data integration

by means of asynchronous update propagation. Update propagation in itself provides

its own unique challenges and forms the basis of discussion of the following section.

2.5 Update Propagation

Update propagation refers to the problem of updating copies of an object (Date, 2004)

and is commonly associated with distributed systems. The problem is centred on the

need to ensure that if a change is made to an object (e.g., a row is updated in a table)

then that change must somehow be disseminated to all other copies of that object in

the system (Date, 2004; Silberschatz, Korth and Sudarshan, 2006). This behaviour

ensures that data remain consistent at all sites. Within a distributed context, Ozsu

and Valduriez (1999) refer to this situation as “mutual consistency”, which they define

as “the condition that requires all the values of multiple copies of every data item to

converge to the same value” (Ozsu and Valduriez, 1999, p. 21).

The methods used to implement update propagation can be classified into two types:

synchronous or asynchronous, sometimes also referred to as eager or lazy approaches

respectively (Breitbart, Komondoor, Rastogi, Seshadri and Silberschatz, 1999).

The main advantage of a synchronous approach is that it can implement real-

time data consistency throughout the entire system. This is due to the fact that

a synchronous approach will use a two phase commit transaction protocol to apply

changes to all targets within the system as one transaction (Buretta, 1997). In this

situation latency (the time between when an update is made and when its effects

have been dispersed throughout the system, e.g., updating replicas in a distributed

database) is effectively zero because all the distributed copies are updated in one atomic

transaction. A disadvantage to this approach is that it is not particularly scalable

in terms of supported transaction volume, due to the increase in the probability of

deadlocks occurring as the number of transactions taking place increases (Gray, Homan,

Korth and Obermarck, 1981). This means that as the number of transactions performed

within the system increases, so does the probability of the system becoming unavailable

to users due to resource contention (i.e., deadlocks). However, techniques like two phase

locking used to prevent, avoid and recover from deadlocks as described by Silberschatz

et al. (2006); O’Neil and O’Neil (2001); Connolly and Begg (2005); and Atzeni, Ceri,

Paraboschi and Torlone (1999) can be adapted for use in a distributed setting.

In contrast an asynchronous approach can provide loose consistency (Buretta,

25

1997). Latency is always greater than zero, so there is a higher degree of lag be-

tween when the original update was executed and when the effects of that update have

been propagated throughout the various parts of the system. Unlike a synchronous

approach, an asynchronous approach does not adhere to a two phase locking protocol

and can be implemented in several different ways. One method performs regular re-

freshing of all the distributed sites; this can be done by either using a complete or an

incremental refresh.

A complete refresh is achieved when updates to the primary data sources are queued

up and executed as a batch resulting in a blanket update of everything within the

system. An incremental refresh works in much the same way as a complete refresh

except that only changes that have been made since the last refresh occurred are

processed. A disadvantage with complete or incremental refreshing is that it strips

the transactional features or granularity from the updates when they become queued

in the data staging area. Unlike the synchronous approach, where updates would be

propagated at or very near to the time they eventuated, an asynchronous approach may

distribute updates as a batch. This means it is more difficult to provide serialization

to transactions which in turn makes it more difficult to roll back the database to a

previous consistent state in the event of some error or failure.

However, if a system does not need constant real-time data consistency, an asyn-

chronous approach will allow for a flexible implementation. Another disadvantage,

caused by the increased latency between updating the original and updating the copies

is the issue of how to deal with conflicting updates that may have been executed later

or from another site.

2.6 Data Streaming

So far the discussions and examples presented have dealt primarily with data that

resides in more “traditional” data management systems. Recently, however, new com-

mercial/societal environments have led to systems having to deal with data not stored

in a static conventional space but that are instead coming from dynamic sources. Such

sources can be referred to as streams of data. Data streaming refers to models or

systems where the data in use or in demand are not conducive to being housed in a

conventional relational structure; rather the data arrive as a continuous, transient flow

or “stream” of data (Golab and Ozsu, 2003b; Babcock, Babu, Mayur, Motwani and

Widom, 2002).

26

Query

ProcessorInput Buffer Output Buffer

Working

Storage

Static Storage

Summary

Storage

Query

RepositoryStreaming

InputsStreaming

Outputs

Streaming

Inputs

User

Queries

Figure 2.7: Golab and Ozsu’s (2003a) DSMS architecture.

Figure 2.7 depicts an example of a generalisation of a data streaming management

system architecture (Golab and Ozsu, 2003a). The approach to querying such a data

source is also different to that of a persistent relational source. A query posed over

a conventional relational schema will produce an accurate result set based upon a

snapshot in time of the data being queried; an RDBMS will produce a query result

based upon the state of the data it is holding at the time the query was executed. This

type of querying is difficult to do when the data source in question is in fact a stream of

transient data in which its state is subject to constant change. The practical difficulty

arises due to the inordinate demands on a system’s finite memory resources (Golab

and Ozsu, 2003b; Babcock et al., 2002; Golab and Ozsu, 2003a; Xu, 2001; Madden and

Franklin, 2002). Therefore the type of querying often performed in a data streaming

scenario can be classified as continuous (Arasu, Babu and Widom, 2004) and often the

results are more approximate in nature.

Examples of current applications that use or generate data streams can be found

in areas such as sensor monitoring, finance, medicine and asset management, among

others. The following three systems are examples of applications from these areas.

27

• Ethereal (Ethereal, 2005) is a network monitoring and analysis tool that can

capture live data from a computer network, such as traffic levels, packet data

etc., summarise it and display that summary to the user or allow the user to

perform more sophisticated analysis.

• LifeShirt (Vivometrics, 2005) is a new product that enables non-invasive moni-

toring of a human patient’s vital signs e.g. pulse, CO2 levels or blood pressure.

The domestic version sends data to a Palm PC-like device for doctors to view.

A military version of the system is being investigated (Vivometrics, 2006), which

would also stream patient data back to a central point to help in planning medical

evacuations or personnel deployment.

• Apama (Progress Software, 2006). Figure 2.8 illustrates the architecture of the

Apama financial analysis tool. It accepts data streams coming from sources which

can be used in combination with other data sets to generate financial analysis

information for the user.

Data streaming has become a significant area of research in its own right. Due to

changes in commercial and societal conditions, information systems of various forms

more frequently have to deal with data coming from sources that are highly dynamic

and transient. In response to this challenge, Data Stream Management Systems like

that illustrated in Figure 2.7 are being developed to allow users to derive meaningful

information from these data streams. An Atom feed can be considered a type of data

stream, the feed is potentially a transient collection of data as it can have new entries

added to it at any given time.

2.7 Summary

This review has presented a series of topics from the recent body of work in the fields

of data integration and data management to illustrate and discuss where both this

research and the Atom specification are positioned. The Semantic Web was addressed

initially in order to provide background and context to the Atom content syndication

format. Atom has been developed in response to perceived issues regarding RSS,

the content syndication standard written in RDF, which is the language for building

applications for the Semantic Web.

Data integration looks at the problem of trying to portray a unified view to a user

of data sets that might not only be located in different places but also structured in

28

Open APIsSystem

Monitoring

Integration Adaptor Framework

GUIs

Other

Remote

Calculators

External

Libraries

Run Time

Dashboards

Graphical

Scenario

Development

Apama

Adapted from original graphic located at

http://www.progress.com

Figure 2.8: Apama financial analysis system (Progress Software,

2006).

29

a variety of different formats. We evaluated Atom for its potential as a lightweight

platform to support data integration by means of asynchronous update propagation

from a series of data sources to a single target database.

The publish/subscribe paradigm was discussed next to show an environment in

which Atom could be used. Additionally, a pertinent piece of research that used the

Hermes publish/subscribe infrastructure as a novel platform for data integration was

discussed. A feature that both the system presented in this thesis and the Hermes based

system presented by Vargas et al. (2005) share is that they facilitate data integration

by means of asynchronous update propagation.

Update propagation refers to the concept of ensuring distribution of updates to all

other copies of that object in the system.

Data streaming and the Atom feed can be considered to share similar characteristics.

A significant area of research in its own right, information systems increasingly have

to be able to process this kind of data that are highly dynamic and transient; however,

this is beyond the scope of this work.

In the next chapter, we will discuss the prototype implementation of the Atom-

based data integration architecture built for the purposes of this research.

30

Chapter 3

System Design

3.1 Introduction

This section discusses how an implementation of the Atom-based data integration

architecture was created for the purposes of this research. First Section 3.2 summarises

a series of use cases that were used to give guidance in development of the Atom

prototype. In order to evaluate the concept of using Atom for data integration, it

was decided to undertake a series of implementations based on use cases derived from

previously completed projects. Three use cases were identified as candidates, with each

one having a degree of complexity and scale slightly greater than that of its predecessor.

The first was the implementation of a prototype as an infrastructure for a movie

timetable e-catalogue system. This was considered the most elementary of the three

cases. The second case was to provide query propagation functionality to a digital

music (e.g. MP3) retail system; this system, in essence, extends the functionality of

the movie timetable system, by providing the capability to not only add and delete

records, but also to update them. The third and final use case was a data warehouse

solution for an electronics supplier. However, an implementation based on this case

was not completed due to time and resource constraints.

Due to the evolutionary nature of the development of the prototypes for each case,

each implementation consisted of a very similar architecture whereby data was exported

as an Atom feed from the source database(s), after which the feeds were consumed by

an integration module, then applied to the target database.

In Section 3.4 a brief description of the development environment and implementa-

tion rationale is presented, followed by a more in-depth look at the various components

of the implemented architecture. Section 3.5 discusses a comparison between our im-

plemented architecture and that of Vargas et al. (2005).

31

3.2 Use Cases

3.2.1 Movie Timetable e-Catalogue

Case Scenario

The first, least technically demanding of the scenarios, was to create a system to

integrate movie timetable data from multiple cinemas. In this scenario the Atom

architecture uses movie timetable databases located at cinemas as data sources. The

target is a database holding content for a dynamic data-driven website.

The purpose of the website is to allow users to browse currently screening and

soon to be screened movies to find out what cinemas are showing which movies, the

screening times of the movies and to compare ticket prices between cinemas showing

the same movie. Such an application relieves users from the task of remembering all

the possible cinemas they could attend and allows them to focus on actually finding

out the information they need about the movie that is of interest to them.

System updates are infrequent; as this reflects the period during which a movie

would be shown at a cinema which is commonly several weeks.

Design Goals

The main goal in this scenario is to provide an infrastructure to integrate movie

timetable data from different cinema databases. The system is geared toward a light-

weight environment with source data updates occurring at a low frequency (such as

weekly). All previous data is overwritten by the latest.

Assumptions

The system is not intended to provide an archival service; it will provide users with

current timetable information for movies currently playing or movies that are to begin

screening very shortly. No update functionality is available in this system, so changes

require the entire Atom feed for that particular cinema data source to be refreshed.

3.2.2 MP3 Music Retail System

Case Scenario

With the initial movie catalogue implemented, the following use case scenario was

implemented to extend the functionality of the prototype. The system documentation

32

for the music kiosk project has been provided in Appendix A. The scenario for this

case is that a kiosk has been designed to sell music in digital format to consumers. The

kiosks are placed in high foot-traffic areas like shopping centres, or leased to businesses

such as music shops to complement their existing trading.

The system provides an interface to the user allowing them to search the databases

of music suppliers (record labels, for example Sony or EMI) for the specific albums or

tracks that they are looking for. Once the user has selected the tracks they want, they

use their credit card, EFTPOS card or mobile phone to pay for the tracks, which are

then downloaded to their portable storage device such as a portable hard disk or MP3

player, or written to a blank CD.

Design Goals

What was needed was an architecture that could act as a mediator between the mu-

sic supplier’s databases and the database within each of the music kiosks. Another

requirement was that the implementation should not interfere in any way with the

supplier’s data sources. The architecture needed to be able to provide the ability to

insert, delete or update records stored in the kiosk database, reflecting changes to the

stock of digital music stored at each of the participating music supplier’s databases.

In addition, due to the fact that changes to the source data are sporadic, the system

had to be highly responsive to ensure the data housed in the kiosk are kept up to date.

Assumptions

No music files are stored on the kiosk itself; rather each kiosk has its own database

which stores the locations of the music files, and thus the only type of data stored

would be essentially alphanumeric text.

3.2.3 Electronics Retailer Data Warehouse

Case Scenario

The final case to be developed was that of an architecture to support a data warehouse

for an electronics retailer. In this scenario, data from the retailer’s outlets throughout

New Zealand are transformed and inserted into an Atom feed ready to be consumed

by the data warehouse.

33

Design Goals

The goal of this implementation was to extend further developments from the previous

use cases to provide a means to send query requests between the retail outlet databases

and the company headquarters where the data warehouse is located. This would pro-

duce a prototype that could provide full data manipulation capability to heterogeneous

data sources. Due to time constraints, this implementation was not able to be com-

pleted, however it does show the direction that the development of the prototype was

taking.

3.3 Requirements Summary

There are several requirements shared by all the use cases presented. The implemented

system needed to be lightweight in terms of network and computational resource con-

sumption. Furthermore, it had to provide a non-invasive means of exposing source

data, i.e., it is strictly a mediator between the source and target objects. Finally, the

system should also be platform independent to reflect the diversity of environments it

could reside in.

3.4 Development

3.4.1 The Development Environment

The criteria for the selection of technologies to develop the prototype in were derived

from three considerations:

1. The technical abilities of the researcher.

2. The goals of the use case scenarios.

3. The resources that were accessible from the University of Otago Information

Science Department.

The development environment consisted of a single Dell Optiplex GX280 computer

with a single 2.8GHz CPU and 1 GB of RAM that had been issued to the author at the

commencement of this research. The computer had installations of the IIS 5 web server,

PHP 5, MySQL 4, the Firefox web browser and Windows XP Professional operating

system. In addition there was access to PostgreSQL and SQL Server instances via the

34

University of Otago campus network. Development of PHP scripts was undertaken

with a simple text editor.

The IIS 5 web server was used initially because it was already installed and con-

figured on the computer. PHP is a scripting language for Web based application

development; PHP 5 was chosen because of the authors’ familiarity with the language

and for its support of multiple operating systems, databases and other technologies.

MySQL was chosen for similar reasons to PHP, while the Firefox browser was selected

because it was available in both PC and Apple versions, which was useful as the testing

environment was located on a network of Apple computers (see Section 4.3).

3.4.2 Implementation Rationale

As mentioned in Section 3.2, the goal of the research was to identify the potential of an

Atom-based lightweight architecture for facilitating a data integration solution suitable

for general small scale scenarios, for example for SME’s.

The design for the prototype data integration architecture uses an Atom feed as a

mediator between a data source and the target. Furthermore, the architecture design

comprised two layers: the data export layer and a feed processing (mediation) layer,

as shown in Figure 3.1. An Atom feed generator is located within the data export

layer and is responsible for exposing new data from the source and sending it to the

Atom feed. Within the feed processing layer an Atom feed consumer is responsible for

reading the Atom feed and applying the updates to the data target. The prototype

makes use of a predefined hard-coded mapping written specifically for the schemas used

in the experiments and implemented in PHP. The prototype makes use of a predefined

hard-coded mapping written specifically for the schemas used in the experiments and

implemented in PHP.

Altering the means by which the feed generator, the Atom feed and the feed con-

sumer interact with each other enables different configurations of the data integration

architecture to be implemented. Figure 3.1 presents three suggested configurations

named “Pull”, “Push” and “Push + Pull” respectively. Pull represents the most simple

configuration in that the Atom feed generator/builder and the feed consumer operate

completely independently of one another, therefore the flow of information to the tar-

get is governed by the feed consumer. The Push method controls the consumption of

Atom feed data on the basis of change in state of the source data.

The Push + Pull method is more complicated than the previous implementations by

enabling the feed generators and consumers to message one another when new data is

35

available and also potentially allows flow of data back to the data source. Flow back to

the data source could also be theoretically implemented in the previous configurations

by providing a feed consumer to the original data source and a feed generator to the

original data target.

Further discussion on the Push and Pull configurations can be found in section

3.4.5; in addition, it should be noted that the mappings from the data source to the

Atom feed and from the Atom feed to the data target have been explicitly specified

i.e. hard-coded.

The decision was made to build a prototype system that implements the theoretical

Atom-based architecture. This option was adopted because this would provide direct

feedback as to the effectiveness of such an approach to data integration.

The prototype was implemented using PHP 5, as this technology requires little

overhead to set up, is platform independent and from a pragmatic standing, provides

an opportunity to explore implementation related issues and testing.

The prototype can work with MySQL, PostgreSQL and SQL Server database servers

(testing was carried out using MySQL database servers) and exhibits some degree of

platform independence by being capable of running on both a Windows XP operating

system (the development environment) and a Mac OS X operating system (the testing

environment). As Mac OS X is UNIX based, this indicates the system should also run

on other UNIX/Linux derivatives.

The following sections illustrate the key components of Atom based architecture

and the options available regarding how the components could be configured to create

different forms of the architecture presented in Figure 3.1. Essentially, the architecture

works by routinely checking the state of the data against a previous copy of it. When a

change is detected, a new Atom entry containing the information regarding the change

is created and appended to the Atom feed. The change is applied to the target when

the targets Atom feed consumer parses the feed and processes the new entry. Section

3.4.5 describes two particular configurations of the system that we investigated.

3.4.3 The Feed Builder Module

Figure 3.2 presents a flow chart of the Atom feed builder module implemented in this

prototype architecture. It is a completely self-contained unit, consisting of two key

components:

1. The staging database, used to capture update data for the atom feed.

36

Feed Processing Layer

Data Export Layer

Data Source Data Source Data Source

Atom Feed

Generator

Atom Feed

Generator

Atom Feed

Generator

Atom Feed Atom Feed Atom Feed

Atom Feed

Consumer

Atom Feed

Consumer

Atom Feed

Consumer

Target

Database

Staging

Database

Staging

Database

Staging

Database

“Pull” “Push” “Push+Pull”

Figure 3.1: Overview of the basic architecture

37

End.

Wait X Seconds.

Copy Source Data

to Staging

Database.

Data Source Content is Different

to Previous Copy?

Update Atom Feed.

Reset Staging

Database Tables.

True

Start

Shutdown Builder?True

False

False

Figure 3.2: Flow chart of Atom feed builder

38

2. A library of functions and object classes that implement the functionality of the

feed builder itself.

The reason for using a staging database as part of the prototype architecture is a

direct response to requirements set down in the use case scenario. As mentioned in

Section 3.3, the data integration architecture must not in any way interfere with or

require intrusive alterations to the systems used to store and manage the source data.

The staging database constructed for this implementation consists of two identically

structured tables. The structure of these tables is, in essence, a specialised denormalised

form of the source data schema. Each column within the staging database tables

represents a corresponding column from the data source, which was chosen based on

the data requirements of the target. The term “column” has been used, as the target

in the prototype environment was a relational database, however, there is no reason

from a theoretical standing why the target, or data source for that matter, cannot be

some other form of data model or structure.

The feed generator queries or “polls” its corresponding data source at regular in-

tervals and compares that query result (snapshot) to a previous snapshot, also stored

in the staging database. The comparison query set consists of three separate queries;

one that checks for newly inserted data, one for deleted data and one that looks for

data that have been updated. Data that have remained unchanged since the last time

the data source was polled are ignored.

If the latest query results differ from the previous snapshot, then updates have oc-

curred in the data source, and new entries corresponding to these source data changes

are created. Figure 3.3 illustrates an example entry for an album record that is to be

inserted into the target database. To create the entry items, the feed builder first col-

lects all the data required for the update from the columns within the staging database.

Then using pre-defined functions from the library, the feed builder transforms these

data into an Atom 0.3 draft standard entry item. Each column in the staging database

has a corresponding element within the Atom entry. For example, a column name

“track title” in the staging database would have an entry item element resembling:

<track_title>Take Five</track_title>

The new entries are then appended to the existing Atom feed in order of occurrence,

thus creating a succinct and sequential series of updates that can be parsed and pro-

cessed by the data target’s Atom feed consumer. A benefit of appending entries in this

way is that it allows for safe concurrent updates to be applied to the target even when

39

<entry>

<content>

<title>Soultrane</title>

<link rel="alternate" type="text/html" href="http://www.url.com"/>

<cmd>ADD</cmd>

<artist>John Coltrane</artist>

<genre>jazz</genre>

<cover-path>cover.jpg</cover-path>

<release-date>1958</release-date>

<track title="title1" length="0:12:08" location="pth1" size="16.6" track_no="1"/>

<track title="title2" length="0:10:55" location="pth2" size="15.0" track_no="2"/>

<track title="title3" length="0:06:17" location="pth5" size="5.21" track_no="3"/>

<track title="title4" length="0:04:56" location="pth4" size="5.26" track_no="4"/>

<track title="title5" length="0:05:34" location="pth3" size="15.0" track_no="5"/>

<modified>7-1-2006:21:56:21</modified>

</content>

</entry>

Figure 3.3: Example Atom entry from the MP3 kiosk use case.

the target’s feed consumer has been offline for a length of time (e.g. due to a network

connection failure). However, one trade-off apparent in the current implementation is a

situation where the feed size is ever increasing, which could lead to performance prob-

lems after some time. This problem could be reduced by implementing some form of

archival system where a chain of files would be created over time instead that logically

represented the Atom feed as a whole.

Once the Atom feed has been “refreshed” by the feed builder appending the new

feed entries, the query result from which those new updates were garnered replaces the

previous snapshot. This is done by truncating (emptying) the table and inserting the

new data. At this point, the process is finished and a countdown until the next time

the data source should be polled is resumed.

To summarise the features of the feed builder module, it is a completely self-

contained and autonomous process. It polls the data source at a regular, predetermined

time interval to compare the state of the source data with how it was the previous time

the module checked it. Updates occur when changes are discovered in the source data.

New Atom format entries are appended to the end of the Atom feed file.

40

3.4.4 The Feed Consumer Module

From an architectural viewpoint, the Atom feed consumer module is very similar to

that of the feed builder. The consumer comprises two components:

1. A timestamp item; which stores a timestamp of the last time the consumer

applied an update to the target.

2. A library of functions and object classes implementing the feed consumer func-

tionality.

The flow of data is the converse to that of the feed builder, i.e., from the Atom feed

to the target, rather than from the source to the Atom feed, as illustrated in Figure

3.1.

Figure 3.4 presents a flow chart of the prototypical Atom feed consumer. The feed

consumer works by polling the Atom feed, parsing the feed, and then comparing the

feed’s <modified> or <updated> element content1 with the timestamp it has stored in

its repository. If the two timestamps are different, and the timestamp in the Atom feed

is “newer” than that of the consumers, then the Atom feed must have been updated

recently, therefore the feed consumer module initiates the update process.

The update process involves the feed consumer iterating through the list of en-

tries currently contained in the Atom feed. When the consumer finds an entry whose

timestamp (located in the entry’s own <modified> element) is newer than that of

the consumer, it processes that entry. In the music kiosk system use case, the feed

consumer processes an entry by using a pre-defined mapping stored in its library to

construct a transaction that is in the correct language and syntax (in this instance

SQL) and maps the entry’s elements to a corresponding field in the target database2.

Once the feed consumer has reached the end of the Atom feed, the update pro-

cess is completed by updating the consumer’s timestamp repository and it restarts its

countdown for polling the Atom feed once more.

1In this case the modified element content is compared as the current prototype only has support

for the draft 0.3 Atom specification.2One obvious improvement would be to enable direct alteration or creation of a mapping by the

user through some kind of GUI, similar in concept to Altova’s “MapForce” product (Altova, 2005).

The rules could be housed within a native data structure of the consumer module and enforced through

a generic mapping core.

41

End.

Wait X Seconds.

Parse Atom Feed.

Consumer and Atom feed

Timestamps are Different?

Apply Updates to

Target.

Overwrite

Consumer

Timestamp with

Atom Timestamp.

True

Start

Shutdown Consumer?True

False

False

Figure 3.4: Flow chart of Atom feed consumer

42

3.4.5 System Configuration

The advantages of PHP with its platform independence features and the modularity of

the prototype components lend themselves to a variety of system configuration options,

both in terms of the location of various components and the means by which they

interact. For the purposes of this research, two methods of configuring the architecture

were investigated; classified as “Push” and “Pull” respectively.

Push

With the push method, the consumption of feed information is governed predominantly

by changes in state of the source data, i.e., when the feed generator detects a change in

state of the source data (for example when a record is updated) the feed is regenerated

(see Figure 3.1) and the consumer module is called immediately to apply the new

information to the target schema. The majority of this activity takes place at or

near the source data location, however, in practice, the location of each component is

not important as they are web-based. The current prototype partially supports the

push method; its functionality is limited due to some issues outstanding that restrict

its performance; in particular the efficiency of managing the influx of updates to the

target needs improvement.

An initial criticism of this approach is that it can cause resource contention or a

“stampede-like” effect when multiple updates from multiple sources are pushed simul-

taneously to the target. This could be alleviated by adding an input staging area at

the target, or adding more intelligence to the consumer modules in the form of a cache

that would store updates and pass them to the target at the next window of oppor-

tunity. These additional layers of complexity however may make the push method a

less appealing option than other configurations, as the implementation of such features

could detract from the original goal of a “lightweight”, agile system.

Pull

The pull method differs from the push approach on two key points. First, the feed

consumer modules operate independently of, and are therefore not directly influenced

by, the feed generator component. Secondly, the flow of feed information to the target

schema is governed by the consumer module itself (see Figure 3.1); that is, the consumer

module will regularly check or “poll” the Atom feed to see if it has changed. This is

done by simply checking the Atom feed’s <modified> element. Hence, rather than

43

forcing or pushing feed data, it is instead “pulled” down to the target.

The pull approach uses only one feed consumer module for all of the Atom feed files,

unlike the push method, where each feed has a dedicated consumer. This approach

has a distinct advantage; it enables a constant stream of sequential transactions to be

passed to the target database without the need for additional layers of complexity to

manage such a feature. This means an avenue to implement a simple consumer module

is established and problems of resource contention and/or congestion between different

feeds and the target are alleviated as each feed is processed in an orderly manner.

There are however some potential disadvantages associated with this approach.

Firstly, as implemented in the prototype, the updates for one feed need to be processed

and completed before the next feed can be parsed and checked for updates. This issue

may create performance problems in situations where many updates occur at a high

frequency. Therefore a more efficient approach in terms of implementation would be to

break down the consumer module into two distinct processes run in parallel: 1) parsing

the feed and extracting the update data, while 2) applying updates to the target.

3.5 Discussion

Although the system of Vargas et al. (2005, discussed in Section 2.4) is at a more ad-

vanced stage of development than the Atom-based prototype, a comparison of the two

approaches still yields some interesting points. First at a broad level both approaches

deliver update notification data asynchronously and they both use an XML-based spec-

ification to format those messages. However, the underlying methods by which the two

approaches do this are quite different. For instance, the Hermes-based system makes

extensive use of triggers and has developed additional features that enrich the source

relations with more reactive behaviour. In contrast to this, the Atom prototype makes

use of a staging area to routinely compare the current sate of the database with a

previous “snapshot” to infer any changes. Therefore the Hermes approach represents

a more reactive and ad-hoc behaviour in comparison to the Atom prototype which is

regulated and more akin to a batch process.

Obviously, the Atom prototype’s data staging approach is rather simplistic and is

not overly efficient, as it relies on using additional copies of the data of interest to infer

if any changes have taken place. However, one particular advantage of this approach is

that it is potentially easier to port to other database platforms as it only requires two

additional tables to be created and some minor changes to a handful of SQL queries in

44

order to be set in place. The Hermes-based approach however, because it is specifically

designed with PostgreSQL in mind, would require a significantly larger amount of work

to port its equivalent reactive relation behaviour to other database platforms because

of the additional extensions it makes to the database’s relations.

Another key difference is the means by which updates are actually propagated to

targets/subscribers. The Hermes-based system actively posts notification messages

as soon as a specific event trigger has been detected. When a subscriber (target)

becomes active it receives the publish notifications from Hermes. However, in the

Atom “pull” approach the target is not directly notified of changes. Instead, like the

data staging area, the target routinely polls the Atom feed to see if any changes have

in fact occurred. Compared to the active push of notification information by Hermes,

the Atom consumers only find out if new information is available when they poll the

Atom feed.

With regard to the Atom feed, it has features found in both the Hermes publisher

messages queue and the Hermes infrastructure. This assertion is based on two ob-

servations. First, the Atom prototype appends newly created update messages to the

existing list or “queue” in the Atom feed. Second, the feed itself represents the interme-

diary between data sources and targets, which is analogous to the Hermes infrastructure

role, albeit in a far more passive manner.

The comparison of the two systems can be condensed down to the following findings:

the Hermes system utilises the publish/subscribe paradigm to create an event-driven

asynchronous data integration framework, whereas the Atom prototype system adopts

a technique of asynchronously sampling the source data at a fixed frequency to infer

if new changes have occurred. The overriding theme of this comparison is that both

prototypes represent two approaches to data integration by means of asynchronous

update propagation.

The author’s Atom prototype architecture, when in the pull configuration, does

lend itself to providing a platform to facilitate update propagation. In fact, the re-

sulting architecture uses an asynchronous propagation approach to provide a means of

integrating data from a source to multiple targets, providing those targets all have an

Atom feed consumer module. In the architecture there are behaviours that can be used

to support this assertion. First, in its current configuration, the architecture makes

use of batch processing to carry out tasks. This can be seen in two areas:

1. The feed builder polls the data source at regular pre-defined intervals. If a change

in the database has been detected then the Atom feed is updated.

45

2. The feed consumer at the target end mimics this process by polling the generated

Atom feed to check for new updates.

Secondly, the architecture makes use of a data staging area to compare the current

instance of the source data with a recent copy to discover any new changes. At that

point these changes are prepared for, and appended to, the Atom feed.

The third and most compelling trait is that the updates coming from the Atom

feed for the targets to consume are not processed in real time relative to the time of

the originating event in the source. This observed behaviour means that the latency

between a target and its source is always going to be greater than zero, which is

consistent with the definition of an asynchronous method.

Sections 2.4 and 2.5 explained that an asynchronous approach is easily scalable

because of its general, simplified support infrastructure; allowing connections between

objects to be decoupled in terms of synchrony, space and time. However, publish/sub-

scribe systems like Hermes still make use of an intermediary infrastructure to manage

the delivery of notifications and data. Syndication technologies like Atom do not rely

on notification mechanisms; rather, the consumer is responsible for checking for new

updates.

Syndication thus represents a further simplified asynchronous framework that re-

moves additional infrastructure between objects, yet still retains the advantages of

scalability associated with asynchronous connection schemes. We have combined this

further benefit of simplified asynchrony with the low cost, platform independent tech-

nology PHP.

The collective advantages of the scalability potential of an asynchronous approach,

the simplified implementation afforded by Atom and the feature rich technology of PHP

display an avenue to create a data integration solution that is lightweight in terms of

impact on an organisation’s available resources.

The Atom-based architecture design resulting from this research activity resembles,

but is not totally identical to, a data streaming model such as the one described by

Babcock et al. (2002). The key difference is in the time and usage domains that the

Atom architecture and a data streaming architecture reside in. This difference is found

primarily in the Atom system’s feed consumer behaviour; the current configuration of

the prototype Atom implementation polls the Atom data feed at a regular, predefined

time interval to check for newly appended data elements. This behaviour is compara-

ble to a standard RSS feed aggregator such as NewsGator3 that routinely checks for

3http://www.newsgator.com

46

updates to the news feeds a user has subscribed to. However this behaviour is in stark

contrast to a system operating a data streaming architecture, in which the stream of

data may be monitored constantly, or in a more “on demand” method, such as the

network monitoring example mentioned in Section 2.6.

This key point of difference in many respects is only evident when a system using

the Atom-based architecture adopts the polling method mentioned above. If, for ex-

ample, the Atom feed was treated as a data stream4, and the Atom based system was

configured so that it reacted or was triggered as soon as the feed was updated, then the

architecture may in fact begin to appear more similar to the data streaming model. An

implementation of this particular configuration was attempted by the author, albeit

with the source being a relational database within a simulated operational production

environment. However, due to time constraints and technical issues regarding web

browser incompatibility between the development and test environments, it was not

possible to conduct thorough testing and data gathering.

3.6 Summary

This chapter has explained the means by which our Atom-based architecture was imple-

mented. Section 3.2 presented a series of use cases used to give guidance in development

of the prototype system. Features common to all the use cases were that the prototype

be lightweight, non-invasive and platform independent as discussed in Section 3.3.

Section 3.4 presented details specific to the actual development of the prototype.

Overviews of the development environment and implementation rationale were pro-

vided before a more in-depth discussion of key components of the architecture. The

perceived advantages of scalability, simplicity and low cost in relation to features as-

sociated with asynchronous connection schemes, Atom and PHP, highlight a potential

means to implement a lightweight data integration architecture.

Finally Section 3.5 compared the implementation of the Atom-based architecture

that we built to the Hermes-based system of Vargas et al. (2005). The following chapter

will describe the experimental design that was used to evaluate the Atom prototype

implementation.

4As it can meet the requirements of a data stream as defined by Babcock et al. (2002); Madden

and Franklin (2002); Arasu et al. (2004); and Carney, Cetinternel, Cherniack, Convey, Lee, Siedman,

Stonebraker, Tatbul and Zdonik (2002)

47

Chapter 4

Experimental Design

4.1 Introduction

This chapter will outline the design of the experiments that were used to evaluate the

Atom prototype’s potential to facilitate data integration by means of update propaga-

tion. Three different tests were designed to evaluate the system’s ability to perform

under various loading conditions, the demands it would place upon the networking

and computation resources supporting the architecture, and the response time of the

system under different configurations.

It was intended originally to test both the push and pull configurations of the

prototype implementation of the architecture. However due to technical issues result-

ing from differences in the development and testing environments, data sufficient for

analysis was only captured for the pull configuration.

The pull configuration test consisted of altering the frequency by which the feed

consumer would poll the Atom feed to observe the pull configuration’s behaviour at

different frequencies. Two frequencies were tested; additional frequencies were intended

to be tested but were not carried out due to time constraints on the availability of

the testing equipment. The first set of test runs were conducted with the Atom feed

consumer set to poll the feed at 15 second intervals, the second set of test runs increased

this interval to 30 seconds.

The expectation for the testing was that if the polling frequency was low then the

response time would also be low enough to accommodate the demands of an opera-

tional processing system. However, this expectation must be balanced in reality if the

system is to progress further than its current prototypical stage in that it also must not

encroach negatively on the network and processing resources that comprise the system

48

operating environment. In other words, if it is found that the system is able to provide

low latency times (the time between when an update originates and when it is applied

to the target) between updates due to a low polling frequency, it may be at the cost

of too high a demand on networking or computing resources.

4.2 Experiment Rationale

The experiments were designed to be capable of generating the type of data needed

to conduct the Atom prototype evaluation. What was needed was a means to ob-

serve and record the prototype’s performance under conditions that could arise within

the use case scenarios. First we needed to define what was meant by performance;

in this context performance has been narrowly defined and further divided into two

classifications:

1. Responsiveness: Specifically this is the latency of the system, i.e., the time the

system takes to carry out a task like propagating an update from the data source

to the target.

2. Impact: This classification embraces features pertaining to the demands the sys-

tem places upon the resources available within the environment the system resides

in, i.e., the computer network.

In other words, any experiments to be used would need to be able to capture data

that would expose information regarding the system’s temporal responsiveness and the

impact the system places on its environment.

The element immediately identified to provide data about the system’s responsive-

ness was that of time; the requirement for any experiment of this type was to be able

to accurately record the time of events occurring within the system.

In terms of impact, the first task was to identify what could be recorded for mea-

suring the system’s impact on its environment. The obvious parameters to focus on

for this feature were the system’s outputs, i.e., the volume of data that the system was

generating and consuming.

However, in addition to performance, another parameter that had to be taken into

consideration was the requirements of the use case scenarios. The requirements provide

a means to compare the performance of the system relative to realistic scenarios.

This additional information enriches the raw performance data with greater mean-

ing. Therefore the experiments designed for this evaluation needed not only to be able

49

to record the performance of the Atom prototype architecture as previously defined,

but also needed to reflect the requirements set out in the use cases. It is important

to note that the principal use case scenario used to conduct the testing was the MP3

kiosk prototype described in section 3.2.2; this scenario was used because the Movie

timetable scenario was deemed too limited in scope to provide useful information, and

the data warehouse scenario was not available for testing. The experiments designed

in response to these requirements are presented next.

4.3 Evaluation Methodology

Response times differ significantly between systems involved in day-to-day operational

processing and analytical/data warehousing systems (Inmon, 1993; Atzeni et al., 1999;

Silberschatz et al., 2006). Inmon (1993) is one of the few to have actually put a

quantitative figure on this difference:

“Analytical response time is measured from 30 minutes to 24 hours. Re-

sponse times measured in this range for operational processing would be an

unmitigated disaster” (Inmon, 1993, preface p. x ).

This difference in response time requirements results from the kind of processing carried

out by these two distinct scenarios. An operational system deals with the day-to-day,

real-time demands of an organisation and as such requires low response times in order

for users to carry out their tasks; queries posed in this situation tend to be sporadic

and ad-hoc in nature with users needing quick access and retrieval of relevant data.

Conversely, a data warehouse often deals with large volumes of data in a manner akin

to batch processing, for example, performing queries for sales analysis at the end of a

company’s financial year. Thus the ability to process large volumes of data at often pre-

determined intervals is more of a factor in systems such as data warehouses compared

to operational processing.

This initial observation of the differences in response times between these two pro-

cessing paradigms forms the basis upon which the evaluation framework for testing the

Atom prototype has been built. The demands the prototype places on network, com-

putation resources and the system’s response time (latency) are cross-referenced with

Inmon’s (1993) observation, along with the requirements of the use cases presented in

Chapter 3, to infer whether the prototype has potential for further development in an

operational environment, an analytical environment, both or is simply not suitable in

50

its current state. The use of response time as an indicator of a systems performance

is a common approach, as mentioned in many of the approaches discussed by Nicola

and Jarke (2000) who surveyed performance modelling techniques for distributed and

replicated databases over the last 20 years.

The evaluation methodology consisted of three approaches in order to test the sys-

tem’s performance and efficiency. The goal of the methodology was to extract meaning-

ful data to infer the performance and scalability potential of the prototype architecture

relative to the observations outlined above, and to observe an implementation of the

architecture under various loading conditions and configurations. As the architecture

is intended for smaller-scale lightweight implementations the testing should serve to

elicit information that would support or reject this intention. The resources for the

testing environments (hardware) where chosen on the basis of what was available on a

scale sufficient to support the testing we had designed and within the time frame that

the testing had be completed in. As a result the load test and operational tests were

carried out on very similar platforms, while the latency testing was run within a differ-

ent environment, as it was undertaken at a later date than the previous experiments.

The three experiments that were created to acquire the data needed to investigate this

problem will now be described.

4.3.1 The Load Test

Description

This test was designed to explore the performance capabilities of the architecture under

various loading conditions. Data captured from the test included:

• Atom feed output size (bytes).

• Source data size, i.e., the size in bytes of the SQL representation of the source

data.

• Feed consumer output (bytes).

• The elapsed time (latency) from when the update process was started until it

was finished.

The loading changes were represented by using different sets of sample data of varying

sizes. The data sets used for this experiment can be found in Appendix B.

51

Environment

The testing environment for the load test comprised five Apple Power Macintosh G5

computers with dual 1.8GHz CPUs and 1GB of RAM each. The computers were con-

nected via an isolated full duplex gigabit Ethernet switch. Installed on each computer

were the Apache 1.3 web server, PHP 5, MySQL 4, the Firefox web browser and Mac

OS X operating system version 10.4. Four of the computers were used as data sources

while the fifth was used as the target.

Procedure

The experiment procedure involved 4 sets of 15 test runs, with each set having another

data source added to the system. Thus, the first set of test runs used one data source;

the second used two sources and so on.

To initiate a new test run, the Atom feed consumer was first shut down if it was

still running. Copies of the consumer’s data capture files were then made before the

originals were emptied in preparation for the next test run. At this point attention

was shifted to the feed builder (or builders, depending on how many data sources were

being used). The generated Atom feed and the feed builders’ data capture files were

copied before the content of the files were deleted. Next, the target database and

source data staging tables were emptied, by using the standard SQL “TRUNCATE”

command.

With all the system elements now reset, the feed builders were restarted to regener-

ate the initially empty Atom feeds, after which as a precautionary measure, a network

ping from the machines housing the Atom feeds to the target was made to ensure there

was nothing wrong with the network. With Atom feeds now regenerated and the net-

work connection checked, the only task left to do was to start the feed consumer to

begin the test run.

4.3.2 The Operational Test

Description

This test was predominantly focused on observing the impact of the Atom prototype

on its host network. A query generator was built, also in PHP 5, that would apply a

set of update queries to the data source at randomly chosen time intervals between 0

and 10 seconds. This was an attempt to represent a realistic production environment

that the source data could likely reside in. As the principal data being collected in

52

this experiment focused on network performance, the network monitoring software

Ethereal (Ethereal, 2005) was used to collect all network activity that occurred while

the Atom prototype was running. Ethereal also has a comprehensive set of analysis

tools that enabled network performance data relating to packets, protocols and traffic

to be collected. Specific data captured from this test included:

• Packet information: the total number of packets that traversed the network dur-

ing a test run, and of that total, how many packets contained HTTP data. Pack-

ets containing HTTP data was important as this was the protocol used by the

Atom prototype to read the Atom feed.

• The total amount of data that traversed the network (bytes).

Environment

The operational test used the same hardware as the load test with the exception of the

network switch, which was replaced with a slower 100Base-T network hub. This change

was necessary in order for the Ethereal monitoring software to be able to collect data.

This difference is attributed to differences in how a network hub and switch operate, i.e.,

a switch establishes a direct point-to-point connection between two nodes, which means

that the Ethereal software cannot monitor the network traffic, unless it is running on

one of those nodes, but this could potentially compromise the test environment. A

network using a hub however results in network data being broadcast, which then

allows Ethereal to view what is happening.

Procedure

Several series of test runs were initiated with each series testing the Atom prototype

in a different configuration. The first stage was to populate the target and refresh the

Atom feeds; this was in essence the same procedure used in the load test. At this point

the feed builders and consumer were temporarily shut down, so the Ethereal network

monitoring software could be initialised to capture network traffic. The feed builders

were then restarted, followed by the feed consumer, and finally the query generators

(one for each data source) were initialised. Each test run lasted for 2 hours.

At the conclusion of each run, the data from that run was collected and the system

reconfigured and prepared for the next test. The reconfiguration consisted of changing

the frequency of either the consumer or the feed builder depending on the architecture in

use, i.e., the consumer polling frequency was altered when the testing involved a “pull”

53

configuration while the feed builder frequency was altered when a “push” architecture

was in use.

Monitoring the various parameters mentioned above provided a means to view the

demands the system placed on the infrastructure it was using, with particular emphasis

on the network resources consumed and time taken for the data transformation and

processing itself.

4.3.3 The Latency Test

Description

This test was designed to gather data about the responsiveness of the system, in terms

of the system’s update propagation latency when operating at different polling fre-

quencies. The query generator was employed once more, however the ad-hoc firing of

queries was disabled, resulting in queries being fired at uniform intervals. A single feed

builder and consumer were set up, as well as a third machine that acted as a time

keeper. As soon as an update was made to the source data a message was sent to

the time keeper and the time of the event was logged. The same action occurred at

the consumer end when the update was applied to the target thus providing a means

to calculate the elapsed time (latency) from when an initial update occurred on the

source to when that update was propagated to the target.

Environment

The latency testing environment was different because the earlier environment was no

longer available; it consisted of: one Dell Latitude CPi laptop with a Pentium 2 300

MHz processor and 128MB of RAM (used for the time server); one Dell Latitude CPx

laptop Pentium 3 500 MHz, 256MB of RAM which housed the target database and

Atom feed consumer; and one Dell Latitude X300 laptop with a Pentium M 1.2 GHz

processor and 1 GB of RAM, which housed the source database, Atom feed builder

and query generator. All the laptops were running the Windows XP operating system,

Apache 2.0.55 web server, PHP version 5.0.4 and the source and target databases were

run on a MySQL version 4.1.11 server. The machines were networked via a Lantronix

LMR8T-2 10Base-T hub.

54

Procedure

The time server was first initialised followed by the feed builder, the query generator

and finally the feed consumer. The query generator was set to execute 100 separate

updates on the source database. Each time an update was executed, an event notice

was sent to the time server along with the content of the update statement. The time

server would attach a timestamp when it received the event notification.

A similar process happened when the feed consumer applied an update from the

Atom feed to the target; as soon as the consumer had an update it would send an event

notification to the timeserver along with the contents of the update it had generated.

Thus every event that occurred in both the feed builder and consumer could be logged,

which meant it was possible to calculate the elapsed time from when an update was

initially applied to the source until it was finally applied to the target.

This procedure was repeated twice, the first run with the feed consumer set to poll

at 15 seconds whilst the second had the consumer polling at 30 seconds. The end result

was elapsed times for 200 separate update events being captured for analysis.

4.4 Summary

The methodology for evaluating the Atom prototype has been presented. The evalua-

tion framework measured the prototype’s impact upon the network and computation

resources it consumed and compared these to observations of response time differences

between operational and analytical processing systems.

The testing observed the system to see if it could operate in a stable yet respon-

sive manner, and at the same time place relatively low demand/loading on its support

infrastructure. If the system can achieve this then the prototype Atom-based archi-

tecture will have potential to be developed further for use in smaller scale commercial

data integration solutions. If however, the system places high to excessive demands

on the resources that it requires, this could imply one of two things: either that the

architecture does not have the potential to facilitate a lightweight data integration so-

lution, or that it may in fact be more suitable to processing tasks that occur at slower

frequencies. The results from conducting this evaluation are presented in the next

chapter.

55

Chapter 5

Results

5.1 Introduction

The following results pertain to the pull configuration of the Atom prototype as a suf-

ficient amount of data suitable for analysis was not collected for the push configuration

due to technical issues mentioned earlier in Section 4.1. In addition, although 30 test

runs were carried out for the operational test, only 18 were used for actual analysis.

This is attributed to the fact that some of the capture files were over a gigabyte in size

which may have caused problems when the data was written to DVD media. Section

5.2 outlines the results for each of the three tests conducted to provide data for eval-

uating the implementation of the Atom-based architecture. First the load test results

are presented in Section 5.2.1, followed by the operational test in Section 5.2.2 and

finally the latency test in Section 5.2.3. The chapter concludes with a summary of the

results in Section 5.3.

5.2 Findings

5.2.1 The Load Test

Figure 5.1 summarises the performance of the Atom prototype in terms of average

elapsed time under different loading conditions. The elapsed time was calculated by

adding the time taken by the feed builder modules to update the Atom feed to the time

taken by the feed consumer module to parse the Atom feed and apply the updates to

the target.

A total of 60 test runs were completed, comprising four sets with each set having a

different number of data sources connected. The first set of test runs were conducted

56

Average Time for Update Propagation

0.0

5.0

10.0

15.0

20.0

25.0

0 1 2 3 4 5

Number of Sources

Tim

e (M

inu

tes)

Figure 5.1: Performance

with one data source, and yielded an average elapsed processing time of 00:04:46,

ranging from 00:04:32 to 00:05:16. With two sources connected, the system completed

processing in 00:09:47 on average, ranging from 00:09:01 to 00:11:25 while this time

increased to 00:16:58 on average with three data sources and ranged from 00:16:04 to

00:18:53. The final set of test runs required the Atom prototype to process data from

4 different data sources, this task on average was completed within 00:23:02 with the

shortest time of the set being 00:22:30 and the longest 00:23:27.

In addition to processing time performance, other data acquired recorded the size of

the Atom feed and output SQL generated in relation to the size of the SQL representa-

tion of the original source data, as shown in Figure 5.2. The sizes of the various outputs

being measured were calculated using the PHP function filesize()which returns the

size in bytes of the file in question.

With one data source, the size of the source data SQL representation was 0.498

MB, the Atom feed measured 0.744 MB and the feed consumer output was 1.208 MB.

With two data sources the source SQL representation doubled to 0.996 MB, the Atom

feed was 1.488 MB and the consumer output 2.432 MB. The third set of test runs used

three data sources and increased the source SQL representation size to 1.494 MB, the

Atom feed size to 3.647 MB and the consumer output SQL to 2.232 MB. The final test

of four sources yielded a source SQL representation size of 1.992 MB, an Atom feed

size of 2.976 MB with consumer output SQL measuring 4.866 MB.

57

Source SQL Representation, Atom feed and SQL code size comparison

0

1

2

3

4

5

6

1 2 3 4

Number of Sources

Ou

tpu

t S

ize

(meg

abyt

es)

Initial Source SQLAtom FeedConsumer SQL Size

Figure 5.2: Comparison of Outputs

Atom Generated Network Traffic (Bytes)

0

1

2

3

4

5

6

0 60 120 180 240 300 360 420 480 540 600 660 720 780 840 900 960 1020

Bytes (millions)

Tes

t R

un

s

15 Seconds30 Seconds

Figure 5.3: Network Traffic Generated by Atom Prototype

58

5.2.2 The Operational Test

As mentioned in Chapter 4, the operational test was designed to observe the prototype

system within the confines of a semi-realistic environment. The data captured from

this test included all network traffic that was generated immediately prior to, during

and immediately after the prototype was in operation. The number of data sources

that the system had to deal with was fixed at four. Figure 5.3 shows the distribution of

test runs in relation to the amount of network traffic generated. A total of 30 test runs

were completed with 15 test runs performed with the consumer polling frequency set

at 15 seconds, and the second set of tests had the consumer operating with a 30 second

polling frequency. Only 18 were used for analysis; this was necessary as a portion of

the network capture files were corrupted and with time constraints placed on the test

equipment, there were no alternatives other than to continue on to analysis with a

smaller sample than what was originally anticipated.

Results from the non-corrupted operational test data show that the average total

amount of data that traversed the network when the consumer was polling at 15 seconds

was 584 MB and ranged from 335.482 MB to 953.235 MB. The average amount of data

to traverse the network when the consumer was set to poll at 30 seconds was 380.322

MB, ranging from 201.334 MB to 683.729 MB.

In addition to the amount of data being generated and sent across the network,

data pertaining to the kinds of packets being sent and received were also collected. Of

particular significance were packets containing HTTP data, as this was the protocol

used by the prototype to send and receive data from the Atom feed. Figure 5.4 shows

the distribution of test runs relative to HTTP packet content. With the system set

to poll at 15 seconds, the average total number of packets containing HTTP data was

605860 and ranged from 348678 to 988018 as seen in Figure 5.4. Conversely, with

the feed consumer set to poll at 30 second intervals the average number of packets

containing HTTP data was 394458, ranging from 240711 to 708744. The average

difference in total HTTP packets between the two configurations was 211402.

5.2.3 The Latency Test

With the consumer set to poll the Atom feed at 15 second intervals, the mean elapsed

time for an update to be propagated to the target was 33 seconds with a variance of 22.3

seconds. However, with the consumer set to poll the Atom feed at 30 seconds, the mean

update propagation time increased to 66 seconds. The resulting doubling in length

59

Atom Generated Network Traffic (Packets)

0

1

2

3

4

5

0 100 200 300 400 500 600 700 800 900 1000

Http Packets (thousands)

Tes

t R

un

s

15 Seconds30 Seconds

Figure 5.4: Packets Generated by Atom Prototype

Update Latency

0

10

20

30

40

50

60

70

80

90

100

0 10 20 30 40 50 60 70 80 90

Elapsed Time (Seconds)

Up

dat

e Q

uer

y E

ven

ts

15 Seconds

30 Seconds

Figure 5.5: Update latency

60

of the polling frequency led to an equally proportional increase to the mean update

propagation latency; it would be interesting in future to conduct tests at additional

polling frequencies to see if this trend continues. Figure 5.5 illustrates a comparison

of the distribution of elapsed times between the two different configurations of the

system tested. The mean difference of update propagation times between the two

configurations was 33 seconds.

5.3 Summary

This chapter has presented the results of the three different experiments carried out to

evaluate the prototype implementation of the Atom based data integration architecture.

The results presented are all associated with the pull configuration of the prototype.

The load test results in Section 5.2.1 show that the average elapsed times for the

prototype system to completely propagate the contents of the data source ranged from

00:04:46 to 00:23:02 depending on how many data sources were connected to the system.

The operational test results in Section 5.2.2 present the impact the prototype system

had on the network resources the testing was conducted on, however the results from

this particular test may not be as compelling as they could have been as the amount

of data captured was less than what was originally intended. Section 5.2.3 presented

results from the latency test which compared update latency between two different

feed consumer polling frequencies. More in depth discussion of these results will be

presented in the next chapter.

61

Chapter 6

Conclusion

6.1 Discussion of Results

Observations of response time differences between operational and analytical processing

systems, as identified by Inmon (1993) and Silberschatz et al. (2006) amongst others,

were used as a measure. The tests that were developed, described in Section 4.3,

recorded data pertaining to response time (latency) of the system, as well as the size

of various inputs and outputs of the system, such as the size of the Atom feed itself.

The results of the experiments were then presented in Chapter 5.

Results from the latency test (see Figure 5.5 on page 60) showed that the proto-

type system is capable of delivering response times low enough to fall within the range

deemed suitable for operational processing environments by Inmon (1993). Further-

more, results from the load testing (see Figure 5.1 on page 57) indicated that the system

in its current form is capable of accurately processing large data sets in a timely fash-

ion. The extent to which these results can be related to each other is limited slightly

as both sets of testing were performed on platforms that varied greatly in terms of

processing performance, as noted in Sections 4.3.3 and 4.3.1 respectively.

However, these results should still be looked at in contrast to what was found with

the operational testing (see Figures 5.3 and 5.4 on page 58). In one particular opera-

tional testing run, it was found that nearly one gigabyte of data eventually traversed

the network, with the system averaging between 380 MB to 584 MB depending on the

polling frequency being used. When looking back at the load test results, the average

size Atom representation for the entire contents of one data source was only 0.744 MB.

What can be drawn from these results is that although the prototype can perform

with a level of latency low enough to support an operational processing environment,

62

it does so at a cost of placing a substantial demand on the supporting environment’s

network resources, in relation to the size of the Atom feed that it generates.

An initial reaction to this observation is the system would be more suitable to an

analytical processing environment, where responsiveness (low latency) is not so much

of an issue; rather, the ability to process large volumes of data accurately has greater

emphasis. In such an environment the polling frequency would be set much lower,

resulting in a much lower impact on network resources.

However, it would be premature to draw such a conclusion from these observa-

tions. We must remember that all the testing conducted for this evaluation has been

performed on a prototype system which, in this case, means that some functionality,

or the means by which the functionality has been implemented, is sub-optimal. For

example, on closer inspection of how the prototype actually works (see Section 3.4.2),

the feed consumer must download the entire Atom feed and parse it to infer whether

any new updates have been appended to the feed.

An immediate, disadvantage to this approach can be seen: when the consumer is

polling the feed at a high frequency, it is essentially downloading the entire Atom feed

at a high frequency, which can result in the generation of a large amount of unnecessary

network traffic. The Atom feed is not overwritten; rather new data are appended to

the end of the feed (see Section 3.4.3), which adds to the initial issue by having an ever-

increasing amount of old or stale data traversing the network. The main advantage of

not overwriting the Atom feed is that it provides a simple mechanism for maintaining

a serialised copy of all the data generated so far.

The SSE specification briefly mentioned in Section 2.2.6 contains a method where

two feeds are created; one is used in a production context, called the partial feed, which

contains the latest updates. The other, named the complete feed, is more of an archival

service; and is useful for initial synchronisation when new feed consumers come online.

This has the advantage that the partial feed can be limited to a certain size to limit

network usage, while at the same time the complete feed can be used to keep data

serialised and be used to initialise new feed consumers or synchronise consumers who

have been offline for some time.

Another important issue, relating once more to the system’s prototype status, arises

from the way the mapping between the data source and target has been created; the

prototype made use of a predefined, hard-coded mapping, written specifically for the

schemas used in the experiments. In many ways this is the reason why the prototype

performed well in terms of accuracy, as the mapping was completely optimised for that

63

one particular case. This is a problem because before the system can be tested further,

or used with differing data sources and schemas, a new hard-coded mapping has to be

written and the prototype’s code base adapted to suit.

This situation may be bearable in simple situations where there are only a few

items to map and programmers are readily available to code the mapping. Realistically

however, this simply defeats the purpose of using the data integration architecture, as

there would be no real difference between using the Atom-based system and building

a solution from scratch. In other words, the prototype in its current form cannot be

generalised to other scenarios very easily.

To summarise so far, interpreting the data at face value may lead us prematurely

to believe that the system would be best suited to an analytical processing context.

However, it is really far too early to consider such an idea when the underlying details

of the prototype’s state of implementation are also examined. When developmental

status of the system is actually taken into account, we find the system requires, initially,

three specific refinements:

1. Addressing the issue of network resource consumption at high polling frequencies.

2. Managing of the size of the Atom feed itself; this is partly related to the previous

issue.

3. Enabling the system to be generalised to other scenarios by improving the means

by which mappings between source and target schemas are specified. This could

be achieved by incorporating an appropriate mapping specification (like those

mentioned in Section 2.3 on page 14) into the architecture, as well as investigating

the use of an ontology specification to construct semantic mappings between

participating objects.

The prototype can really be considered to be at an “alpha” stage of development;

testing has revealed that it, and therefore an Atom-based architecture, is capable of

facilitating data integration between relational databases to an extent. However, to

continue down this path the implementation of a more refined version would need to

be undertaken, followed by additional testing to compare with the initial results of this

research.

When the results are compared with the state of the prototype, it is found that the

system is really at a point where the decision whether to continue further development

work or not needs to be made.

64

Considering the amount of work currently going into the already large body of

data integration research (as discussed in Section 2.3), and that there are already com-

mercially available products such as Altova’s MapForce (Altova, 2005), the immediate

decision could be not to continue further development.

However, with the recent announcements of the SSE and GData specifications (see

Section 2.2.6), there are signs that there is a growing community investigating the

extended usage of content syndication technologies. Therefore, rather than shelving

the project because there is already much work, both academic and commercial, going

on in the field of data integration, an alternative option is to open further development

up to the open source community.

This option would have the advantages of being able to expose the prototype to

others investigating extended use of syndication technologies, gain access to a large

talent pool of developers and in general present the prototype and the ideas behind it

to further scrutiny and debate from a large diverse audience.

6.2 Summary

We have evaluated Atom for its potential as a lightweight architecture to support data

integration. Data integration looks at the problem of trying to portray a unified view to

a user of data sets that could not only be located in different places but also structured

in a variety of different ways.

A series of topics from related work were presented in Chapter 2 to illustrate and

discuss where both this research and the Atom specification are positioned. Atom is

a content syndication specification, and has been developed in response to perceived

issues regarding RSS, especially to address the proliferation of differing versions of

the RSS specification. Furthermore, it provides an avenue for further development

in light of the fact that RSS 2.0 is copyrighted by Harvard university and considered

frozen, that is, further development of that particular branch of the specification will

not continue.

The publish/subscribe paradigm was discussed next to show an environment in

which Atom is used. Additionally a pertinent piece of research that used the Hermes

publish/subscribe infrastructure as a novel platform for data integration was discussed,

in part to illustrate the domain this research wanted to investigate the Atom specifi-

cation within. A feature that both the Atom prototype we developed and the Hermes-

based system of Vargas et al. (2005) shared is that they facilitate data integration by

65

means of asynchronous update propagation, which refers to the problem of updating

copies of an object and is commonly associated with distributed systems. However,

how the Atom-based and the Hermes-based systems do this is quite different.

Data streaming contrasts with the asynchronous behaviour of the Atom-based pro-

totype that we developed. A significant area of research in its own right, information

systems increasingly have to be able to process these kinds of data that are highly

dynamic and transient.

Chapter 3 then presented an implementation of the Atom-based architecture that

we developed for this research. The prototype has been developed in PHP, in part

due to our familiarity with the technology, and also as this enabled a prototype to

be developed that would be ported easily to different platforms. This was of initial

pragmatic importance as both the development and testing environments were located

on different platforms. Furthermore PHP is freely available, which means that it is

a suitable candidate to base an affordable production grade version of the system on.

We combined the freely available, feature rich technology PHP with the simplified

asynchronous connection scheme content syndication technology offers to create our

data integration prototype.

Two particular configuration types of the architecture were presented, namely

“push” and “pull”. Within the push method, the consumption of feed information

is governed predominantly by changes in state of the source data, i.e., when the feed

generator detects a change in state of the source data the feed is regenerated and the

consumer module is called immediately to apply the new information to the target

schema. The pull method differs from the push approach on two key points. First,

the feed consumer modules operate independently of, and are therefore not directly

influenced by, the feed generator component. Secondly, the flow of feed information to

the target schema is governed by the consumer module itself.

Three use cases were presented (see Section 3.2 on page 32) that were used to

give guidance in development of the Atom prototype. Each use case had a degree of

complexity and scale slightly greater than that of its predecessor, however, all the use

cases share some common requirements. The first of these requirements was that the

architecture has a non-intrusive nature; that is, the architecture should act strictly as a

mediatory framework between the sources and target. Furthermore, the implemented

architecture should be lightweight in terms of the network and computational resources

it consumes and it should also be platform independent.

Chapter 3 discussed the key components of the Atom-based architecture, namely

66

the feed generator and feed consumer, before a comparison of the Atom prototype to the

publish/subscribe data integration architecture of Vargas et al. (2005) was presented

in Section 3.5.

Chapter 4 described the experimental design used to evaluate the prototype. The

evaluation framework measured the prototype’s impact upon network and computation

resources when operating in specific use cases presented earlier in Chapter 3, specifically

the MP3 kiosk case.

Chapter 5 presented the results of the three experiments used to evaluate the im-

plemented Atom based data integration architecture. The results presented are all

associated with the pull configuration of the prototype.

6.3 Recommendations and Conclusions

Some initial recommendations of features that should be provided by any further de-

velopment of the Atom-based architecture were identified earlier in this chapter, prin-

cipally enabling the system to be generalised to scenarios outside those found within

this research. Another interesting feature would be to investigate the ability to prop-

agate schemas as well as data. This would enable a means to deploy data sets on

platforms different to what the schema was originally created on without needing to

first manually specify an equivalence mapping. For example, a vendor whose develop-

ment environment is different to that of its client could still develop a data structure

and then use the Atom system to transfer and transform that structure to the client

ready for deployment. Another useful scenario would be when initially establishing a

mapping between a data source and a target; the source’s schema specification could

be sent and the Atom consumer could then derive a mapping solution to the target for

verification by the user.

It would also be worth considering looking at the SSE and GData APIs as a basis for

the development of a future prototype. Both technologies are backed by organisations

with substantial resources, and already have resources for the developer community set

in place. Furthermore, it may be an opportune time to survey the extent of research

regarding the emerging use of content syndication technology, like Atom, outside of its

conventional context.

In conclusion, the architecture presented in this thesis has potential in facilitating a

lightweight data integration solution and can exhibit a non-intrusive behaviour toward

the data objects interacting through it. Our research has shown that an Atom-based

67

architecture is capable of operating within a range of conditions and environments and

with further development, would be capable of greater processing efficiency and wider

compatibility with other types of data structures.

68

References

Adali, S., Candan, K. S., Papakonstantinou, Y. and Subrahmanian, V. S. (1996).

Query caching and optimization in distributed mediator systems, The 1996 ACM

SIGMOD International Conference on Management of Data, ACM Press, New

York, NY, USA, pp. 137–146.

Altova (2005). Altova mapforce database mapping. http://www.altova.com/

products/mapforce/xml to db database mapping.html, accessed 7 October

2005.

Arasu, A., Babu, S. and Widom, J. (2004). CQL: A Language for Continuous

Queries over Streams and Relations, Lecture Notes in Computer Science, 2921

edn, Springer.

AtomEnabled (2005). AtomEnabled. http://www.atomenabled.org, accessed 9

February 2005.

Atzeni, P., Ceri, S., Paraboschi, S. and Torlone, R. (1999). Database Systems: Con-

cepts, Languages & Architectures, McGraw-Hill, London.

Babcock, B., Babu, S., Mayur, D., Motwani, R. and Widom, J. (2002). Models and

issues in data stream systems, ACM Principles Of Database Systems (PODS),

ACM Press, Madison, Wisconsin, USA, pp. 1–16.

Baldoni, R., Contenti, M. and Virgillito, A. (2003). The evolution of publish/subscribe

communication systems, Future Directions of Distributed Computing, Vol. 2584,

Springer Verlag.

Batini, C., Lenzerini, M. and Navathe, S. B. (1986). A comparative analysis of method-

ologies for database schema integration, ACM Computing Surveys 18(4): 323–364.

Beck, R., Weitzal, T. and Konig, W. (2002). Promises and pitfalls of sme integration,

The 15th Bled Electronic Commerce Conference, Bled, Slovenia, pp. 567–583.

69

Berners-Lee, T., Connolly, D. and Swick, R. R. (1999). Web architecture: Describing

and exchanging data. http://www.w3.org/1999/04/WebData.

Berners-Lee, T. and Fischetti, M. (1999). Weaving the Web, Orion Business, London.

Berners-Lee, T., Hendler, J. and Lassila, O. (2001). The Semantic Web, Sci-

entific American . http//www.scientificamerican.com/2001/0501issue/

0501berners-lee.html.

Boyd, M., Kittivoravitkul, S., Lazantis, C., McBrien, P. and Rizopoulos, N. (2004).

AutoMed: A BAV data integration system for heterogeneous data sources, Lecture

Notes in Computer Science, Springer-Verlag, pp. 82–97.

Breitbart, Y., Komondoor, R., Rastogi, R., Seshadri, S. and Silberschatz, A. (1999).

Update propagation protocols for replicated databases, Proceedings of the 1999

ACM SIGMOD International Conference on Management of Data, ACM Press,

Philadelphia, Pennsylvania, United States, pp. 97–108.

Buretta, M. (1997). Data Replication Tools and Techniques for Managing Distributed

Information, John Wiley & Sons, New York.

Cali, A., Calvanese, D., De Giacomo, G. and Lenzerini, M. (2002). Data Integration

under Integrity Constraints, number 2348 in Lecture Notes in Computer Science,

Springer.

Cali, A., Calvanese, D., Giacomo, G. D. and Lenzerini, M. (2002). On the Expressive

Power of Data Integration Systems, number 2503 in Lecture Notes in Computer

Science, Springer.

Calvanese, D., Giacomo, G. D., Lenzerini, M., Nardi, D. and Rosati, R. (1998). Infor-

mation integration: Conceptual modelling and reasoning support, The 3rd IFCIS

International Conference on Cooperative Information Systems (CoopIS’98), IEEE

Computer Society Press, New York, NY, pp. 280–291.

Campailla, A., Chaki, S., Clarke, E., Jha, S. and Helmut, V. (2001). Efficient filtering in

publish-subscribe systems using binary decision diagrams, The 23rd International

Conference on Software Engineering (ICSE’01), IEEE Computer Society, Toronto,

Canada, pp. 04–43.

70

Carney, D., Cetinternel, M., Cherniack, C., Convey, C., Lee, S., Siedman, G., Stone-

braker, M., Tatbul, N. and Zdonik, S. (2002). Monitoring streams - a new class

of data management applications, The 28th Very Large Databases (VLDB) Con-

ference, Hong Kong, China, pp. 215–226.

Chawathe, S., Garcia-Molina, H., Hammer, J., Ireland, K., Papakonstantinou, Y.,

Ullman, J. and Widom, J. (1994). The TSIMMIS project: Integration of hetero-

geneous information sources, The Information Processing Society of Japan (IPSJ)

Conference 1994, Tokyo, Japan.

Connolly, T. M. and Begg, C. E. (2005). Database Systems: A Practical Approach to

Design, Implementation, and Management, Addison-Wesley, Essex, UK.

Date, C. J. (2004). An Introduction to Database Systems, eighth edn, Addison-Wesley,

New York.

Duschka, O. M., Genesereth, M. R. and Levy, A. Y. (2000). Recursive query plans for

data integration, The Journal of Logic Programming 43(1): 49–73.

Ethereal (2005). Ethereal: A network protocol analyser. http://www.ethereal.com,

accessed 7 October 2005.

Eugster, P. T., Felber, P. A., Guerraoui, R. and Kermarrec, A. (2003). The many faces

of publish/subscribe, ACM Computing Surveys 35(2): 114–131.

Farooq, U., Parsons, E. W. and Majumdar, S. (2004). Performance of publish/subscribe

middleware in mobile wireless networks, WOSP ’04: Proceedings of the 4th In-

ternational Workshop on Software and Performance, ACM Press, New York, NY,

USA, pp. 278–289.

Fensel, D., Hendler, J., Lieberman, H. and Wahlster, W. (eds) (2003). Spinning the

Semantic Web, MIT Press, Cambridge, MA.

Friedman, M., Levy, A. and Millstein, T. (1999). Navigational plans for data in-

tegration, 16th National Conference on Artificial Intelligence (AAAI’99), AAAI

Press/The MIT Press, pp. 67–73.

Ge, Z., Ji, P., Kurose, J. and Towsley, D. (2003). Matchmaker: Signalling for dynamic

publish/subscribe applications, The 11th IEEE International Conference on Net-

work Protocols (ICNP’03), IEEE Computer Society, Los Alamitos, CA, p. 222.

71

Goh, C. H., Bressan, S., Madnick, S. and Siegel, M. (1999). Context interchange:

new features and formalisms for the intelligent integration of information, ACM

Transactions on Information Systems 17(3): 270–293.

Golab, L. and Ozsu, M. T. (2003a). Issues in data stream management, SIGMOD

Record 32(2): 5–14.

Golab, L. and Ozsu, M. T. (2003b). Processing sliding window multi-joins in continuous

queries, The 2003 International Conference on Very Large Databases, Morgan

Kaufmann, pp. 500–511.

Google (2006). Google Data APIs Overview. http://code.google.com/apis/gdata/

overview.html, accessed 24 May 2006.

Gray, J., Homan, P., Korth, H. F. and Obermarck, R. (1981). A strawman analysis of

the probability of wait and deadlock, IBM Technical Report RJ3066 .

Gupta, A., S., O. D., Agrawal, D. and El Abbadi, A. (2004). Meghdoot: Content-based

publish/subscribe over P2P networks, in H. A. Jacobsen (ed.), Middleware 2004,

International Federation of Information Processing (IFIP), pp. 254–273.

Haas, L. M., Miller, R. J., Niswonger, B., Roth, M. T., Schwarz, P. M. and Wimmers,

E. L. (1999). Transforming heterogeneous data with database middleware: Beyond

integration, IEEE Data Engineering Bulletin 22(1): 31–36.

IDEAlliance (2006). About PRISM. http://www.prismstandard.org/about/, ac-

cessed 05 November 2006.

Inmon, W. H. (1993). Building the Data Warehouse, John Wiley & Sons, New York.

Koivunen, M. and Miller, E. (2001). W3C Semantic Web activity, Semantic Web

Kick-Off in Finland: Vision, Technologies, Research and Applications, HIIT Pub-

lications, Helsinki, Finland, pp. 27–43.

Lenzerini, M. (2002). Data integration: A theoretical perspective, ACM Principles Of

Database Systems (PODS), ACM, Madison, Wisconsin, USA, pp. 233–246.

Levy, A. Y. (2000). Logic-based techniques in data integration, in J. Minker (ed.),

Logic Based Artificial Intelligence, Kluwer Academic, Dordrecht, pp. 575–595.

72

Madden, S. and Franklin, M. J. (2002). Fjording the stream: An architecture for

queries over streaming sensor data, The 18th International Conference on Data

Engineering (ICDE’02), IEEE, p. 0555.

Madhavan, J. and Halevy, A. Y. (2003). Composing mappings among data sources,

The 29th Very Large Databases (VLDB) Conference, Berlin, pp. 572–583.

Manola, F., Miller, E. and McBride, B. (2004). RDF Primer. W3C Recommendation.

http://www.w3.org/TR/rdf-primer/.

McBrien, P. and Poulovassilis, A. (2003). Data integration by bi-directional schema

transformation rules, The 19th International Conference on Data Engineering

(ICDE’03), IEEE, pp. 227–238.

McGuinness, D. L. and van Harmelen, F. (2004). OWL Web Ontology Language.

http://www.w3.org/TR/owl-features/.

Nicola, M. and Jarke, M. (2000). Performance modelling of distributed and replicated

databases, IEEE Transactions on Knowledge and Data Engineering 12(4): 645–

672.

Nottingham, M. and Sayre, R. (2005). The Atom Syndication Format, 2005. http:

//tools.ietf.org/html/rfc4287.

O’Neil, P. and O’Neil, E. (2001). Database: Principles, Programming, Performance,

Morgan Kaufmann, San Francisco, CA.

Ozsu, M. T. and Valduriez, P. (1999). Principles of Distributed Databases, 2nd edn,

Prentice Hall, New Jersey.

Ozzie, J., Moromisato, G. and Suthar, P. (2005). XML developer center: Simple

sharing extensions for RSS and OPML. http://msdn.microsoft.com/xml/rss/

sse, accessed 24 May 2006.

Pascoe, R. T. and Penny, J. P. (1990). Construction of interfaces for the exchange

of geographic data, International Journal of Geographical Information Systems

4(2): 147–156.

Powers, S. (2003). Practical RDF, O’Reilly, Sebastopol, CA.

Progress Software (2006). Progress Apama algorithmic trading platform. http://www.

progress.com/realtime/products/apama/index.ssp, accessed 10 April 2006.

73

Silberschatz, A., Korth, H. F. and Sudarshan, S. (2006). Database System Concepts,

fifth edn, McGraw-Hill, New York.

Tomasic, A., Raschid, L. and Valduriez, P. (1998). Scaling access to heterogeneous data

sources with DISCO, IEEE Transactions on Knowledge and Data Engineering

pp. 808–823.

Ullman, J. D. (1997). Information integration using logical views, Database Theory -

ICDT ’97. 6th International Conference Proceedings pp. 19–40.

Vargas, L., Bacon, J. and Moody, K. (2005). Integrating databases with pub-

lish/subscribe, The 25th International Conference on Distributed Computing Sys-

tems Workshops (ICDCSW’05), IEEE Computer Society, pp. 392–397.

Vivometrics (2005). Vivometrics technology backgrounder. http:

//www.vivometrics.com/site/pdfs/find.php?file=VivoMetrics

TechnologyBackground, accessed 15 April 2006.

Vivometrics (2006). Advanced real-time monitoring ensemble for first respon-

ders deployed by U.S. military. http://www.vivometrics.com/site/press

pr20060411.html, accessed 15 April 2006.

Wang, J., Jin, B. and Li, J. (2004). An Ontology-Based Publish/Subscribe System,

number 3231 in Lecture Notes in Computer Science, Springer.

Widom, J. (1995). Research problems in data warehousing, CIKM ’95: Proceedings of

the fourth international conference on Information and knowledge management,

ACM, pp. 25–30.

Wiederhold, G. (1993). Intelligent information integration, Proceedings of the 1993

ACM SIGMOD International Conference on Management of Data (SIGMOD ’93),

ACM Press, New York, NY, pp. 434–437.

Wiederhold, G. (1995). Mediation in information systems, ACM Computing Surveys

27(2): 265–267.

Xu, L. (2001). Efficient and scalable on-demand data streaming using UEP codes,

Proceedings of the Ninth ACM International Conference on Multimedia, ACM

Press, New York, NY, pp. 70–78.

74

Yu, C. and Popa, L. (2004). Constraint based XML query rewriting for data integration,

Proceedings of the 2004 ACM SIGMOD International Conference on Management

of Data (SIGMOD ’04), ACM Press, New York, NY, pp. 371–382.

75

Appendix A

Music Kiosk Use Case

A.1 MP3 Kiosk Project Documentation

76

Digital Music Kiosk

Development Team:

Stefan Bryce

Prajesh Chhannabhai

David Williamson

77

Executive Summary

Since the rise in popularity of digital audio, the music industry has sought to make music

available in a manner that is convenient for the consumer in order to continue to obtain

revenue from copyrighted works. Having had several iterations, there appears to be no

serious progress that meets the needs of the mobile consumer.

The latest positive progress has been the advent of paid music content delivered across the

Internet. However, the delivery of music to a person’s home is only a partial solution, and

assumes access to a PC with reasonable Internet access – not a given in many parts of the

world. Instead, high speed easily accessible avenues of delivery are required.

This document outlines a possible solutions taking advantage of available hardware and

the creation of leading edge interfaces to provide what the consumer wants – high-speed

delivery of digital media.

78

1 Introduction ...................................................................................................................... 4

2 Background....................................................................................................................... 4

2.1 Justification of System ................................................................................................. 4

2.2 Existing Product Evaluation ........................................................................................ 5

KIS Company Details........................................................................................................ 5

OverDrive Company Details ............................................................................................. 5

3 Problem Identification ..................................................................................................... 9

3.1 Identified problems and proposed solutions:............................................................... 9

3.2 Future Work................................................................................................................. 9

4 Scope and Objectives...................................................................................................... 10

4.1 Scope...................................................................................................................... 10

4.2 Objectives............................................................................................................... 10

5 Proposed Solution........................................................................................................... 11

5.1 System overview ....................................................................................................... 11

79

1 Introduction

This project proposes the development of a digital music retail kiosk.

A growing trend in the adoption of portable MP3 players and a slowing of CD purchases from traditional music retail channels has emerged1. Many vendors have tried to implement a solution to complement the growing popularity of the MP3 file format, but have not been as effective as originally envisaged. This project attempts to alleviate the shortfalls identified in previous efforts by producing a prototype that may be developed into a commercially viable product.

The purpose of this document is to: Outline the justification of the proposed solution Identify the validity of our solution Confirm the scope and objectives for the project

2 Background

2.1 Justification of System

Due to a trend of consumers moving towards digital audio, rather than traditional, a new niche has been created in the e-commerce industry to satisfy consumer demands. Some key players in the music industry, for example Sony, have attempted to implement their own solutions to fill this void.

We propose a solution that addresses past failures and attempts to cater towards modern consumer needs. Our proposed solution will allow consumers the opportunity to customize their music purchase by means of creating custom music compilations. To make the systems more flexible, the kiosk will have provision for a wide variety of modern high speed storage devices.

It is intended that it will connect to multiple music databases, although this is out of the scope of this prototype.

A system that contains a wide variety of music, with speed, it is anticipated consumers will take to in a positive manner. This will also offer music distributors a legitimate and easy way to distribute legal digital audio.

80

2.2 Existing Product Evaluation

KIS Company Details

Name: Kiosk Information Systems, Inc. URL: http://www.kis-kiosk.com Products/Solutions Products: KIS 770 Kiosk

KIS 780 SNAPTRAX Kiosk Technologies: Based around Dell Optiplex Computer Critique:

The KIS kiosk products provide a wide range of content and services to the client. Deliverables consist of downloadable digital content for mobile phones, digital photo printing and photo CD creation, internet browsing services and digital music retail. The digital music component allows the user to create unique compilations and burn them to CD's or download to a laptop or MP3 player. Although these kiosks provide users with the ability to create music compilations, the range of music available to be purchased is somewhat limited.

OverDrive Company Details

Name: OverDrive, Inc. URL: http://www.overdrive.com Products/Services Product: Digital Rights Management (DRM) Technologies: Microsoft Digital Asset Server

Windows Media Adobe Content Server

Critique Although OverDrives DRM solution does provide a high level of customization at the design level, it does have two key factors limiting its usefulness as a platform for kiosk based music retailing. Firstly the digital audio format supported is limited to Windows Media (.WMA) files only. And secondly, the DRM solution is a generalized specification incorporating a patchwork of technologies. This is in part due to the solution's intended use as the building blocks for myriad digital content distribution applications. However, such architecture is not always suitable for a specialized digital music retail kiosk where responsiveness and flexibility are critical design considerations.

81

MMS Company Details Name: RedDotNet's Multimedia

Merchandising system URL: http://www.reddotnet.com/ Products/Solutions Products: Multimedia Merchandising System Technologies: Run on a windows NT/2000 Platform

Remotely controlled and updated

Critique:

This system is only a music preview system offering no method of burning or downloading the audio music, used only as a listening and searching post The system must be located inside a music retail shop, and only carries a database of the audio that the shop has available for purchase.

Soundbuzz Company Details Name: Soundbuzz URL: http://www.soundbuzz.com Products/Solutions Products: "CDBank" Kiosk Technologies: Customised CD Production Digital

Rights Management Touch Screen Interface

Critique:

The Soundbuzz CDBank kiosks are interactive, freestanding machines that allow a user to burn a custom or single track CD (CD-audio format) of their own choice from a pre-set track list. They also allow the user to take a digital picture and write a personal message that is printed on the CD itself. The machine takes regular cash/coins and cash cards and due to this - it has to be a secure, heavily locked-up machine. Using Microsoft's Windows Media Rights Management technology, music is protected as it is distributed around the Internet. Soundbuzz's in-house DRM solution, developed jointly with Microsoft, ensures media rights are protected and cleared and Soundbuzz's multi-currency payment cart allows users, even those with local currency credit cards, to make purchases online.

82

Charge Me Company Details Name: Charge Me URL: http://www.charge-me.co.uk/ Products/Solutions Product: Charge Me MK1 Technologies Pentium Based PC System Critique: The MK1 kiosk focuses on the delivery of mobile content, i.e. loading of logos, ring tones, images, topping up etc. It does make provision for downloading of music. Currently allows for mobile transfer as well as Memory card download channels, but plans to expand to Bluetooth as well. The kiosk has some of the hardware solutions, but as its target area is the mobile phone market, it has limited range on the music front. It does not allow for custom made compilations, or copying to media such as MP3 devices, or even CD's. It is a good solution for mobile technologies, but doesn't cover the music side of things in much depth. Syncor Systems Company Details Name: Syncor Systems URL: http://www.syncorsystems.com Products/Solutions Product: Swat Team Flexible Interactive Kiosk Critique: The kiosk provides a web-enabled, cross platform database driven tool, to search for music. It allows searches by number, brand and artist. The system is built with a graphical, multimedia, touch screen interface. However its shortfall is that it is basically an electronic listening and search post, hence it does not allow one to download or copy music to any forms of media.

83

Synergy Media Group Company Details Name: Synergy Media Group URL: http://www.touchstand.com Products/Solutions Product: TouchStand Media Kiosk Technologies Wireless Technology

Bar Code Scanning Listening Station Touch Screen Interface Apple computer platform running OS/X

Critique: Developed by Synergy Media Group, the TouchStand Kiosk is a web-enabled, in-store media kiosk that offers retailers and their customers digital audio clips from more than 3.2 million songs, retailer-defined top seller lists, in-depth content searches, consumer data mining, e-mail and mailing list management, labour-free point of sale, and automatic content updating in one integrated package that is branded with the retailer's marketing. Its music database comes from Muze2 Inc. The TouchStand Media Kiosk runs on an Apple with an OSX operating system and an eMac computer with a full colour 17" touch screen. TouchStand's wireless network connection to the internet is secured by SonicWALL's3 encryption technology. Problems identified with this kiosk are that it does not yet offer any copying of music; it currently only serves as an electronic catalogue for retailers, hence there is no ability for it to copy to any media as yet. Also, there is no ability for customers to create a customized search as it only looks for whole existing albums (i.e. not individual tracks

84

3 Problem Identification

3.1 Identified problems and proposed solutions:

Transfer Rate

Transfer rate was found to cause problems in previous projects due to the waiting time for the finished product. Our solution to this is using more advanced hardware than what was previously available, using modem data transfer methods.

Hardware Speed

Similar to the transfer speed, the processing speed of the system sometimes resulted in a hindered performance. With the use of the most modern hardware technology this will aid in increasing the responsiveness of the system.

Lack of Medium options

Previous systems were using only CD-ROM as the principle form of output. This restriction gave the consumer no choice in how they received their content and made the system susceptible to frequent maintenance, for example, reloading the machine with blank media.

Our solution addresses these issues by offering the latest in data transfer technology thus giving the consumer more flexibility in how they acquire their media.

3.2 Future Work

Variety of music

A lack of a wide variety of music that catered to different consumer demographics, restricted sales and usage of previous implementations.

Our solution to this problem will be to make the application adaptable to accept data from multiple distributed sources, therefore providing a more diverse and rich pool of content from which consumers can choose from.

Cash Security

Cash handling has made previous implementations cumbersome and vulnerable to threats such as theft and damage resulting in increased chances of downtime. The physical requirements of storing the cash within the kiosk further restricted the placement options available. It also incurred further overhead with the need for the cash to be collected on a regular basis.

85

4 Scope and Objectives

4.1 Scope

The system prototype will include: • Interactive Graphical User Interface. • Diverse range of hardware inputs/outputs.

Future work includes:

• The ability to connect to multiple databases. • Payment options like EFTPOS and Credit Card.

4.2 Objectives

The proposed system will offer: • Low maintenance system design. • Provision for efficient use through a well defined

Graphical User Interface. • Response by using high speed data transfer methods. • Ease of scalability for the system, both hardware and

software. • Provide a secure transaction environment for both user

and the owners.

86

5 Proposed Solution

5.1 System overview

Inputs User Specific

• Search Function • Browsing capability

System Specific • EFTPOS • Network

Functionality • GUI • Database

Outputs System Outputs

• User Connectivity • USB 2.0, Firewire, Wireless, Memory Cards • CD (Only if administered by local staff.)

Audio • Sample • Purchased Media • Digital file

87

A.2 Schemas

A.2.1 Kiosk Schema-- phpMyAdmin SQL Dump

-- version 2.6.2

-- http://www.phpmyadmin.net

--

-- Host: localhost

-- Generation Time: Apr 14, 2006 at 04:10 PM

-- Server version: 4.1.11

-- PHP Version: 5.0.4

--

-- Database: ‘kiosk_target‘

--

-- --------------------------------------------------------

--

-- Table structure for table ‘album‘

--

CREATE TABLE album (

album_id int(10) NOT NULL auto_increment,

artist_id int(10) NOT NULL default ’0’,

genre_id int(10) NOT NULL default ’0’,

release_date date NOT NULL default ’0000-00-00’,

total_songs int(10) NOT NULL default ’0’,

price_id int(10) NOT NULL default ’3’,

cover_path varchar(200) NOT NULL default ’’,

album_title varchar(50) NOT NULL default ’’,

album_size decimal(3,2) NOT NULL default ’0.00’,

discontinued enum(’Y’,’N’) NOT NULL default ’N’,

PRIMARY KEY (album_id)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

-- --------------------------------------------------------

--

-- Table structure for table ‘artist‘

--

CREATE TABLE artist (

artist_id int(10) NOT NULL auto_increment,

artist_name varchar(50) NOT NULL default ’’,

PRIMARY KEY (artist_id)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

-- --------------------------------------------------------

--

-- Table structure for table ‘genre‘

--

CREATE TABLE genre (

genre_id int(10) NOT NULL auto_increment,

genre_name varchar(50) NOT NULL default ’’,

PRIMARY KEY (genre_id)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

-- --------------------------------------------------------

--

-- Table structure for table ‘price‘

--

CREATE TABLE price (

price_id int(10) NOT NULL auto_increment,

price decimal(3,2) NOT NULL default ’0.00’,

PRIMARY KEY (price_id)

88

) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

-- --------------------------------------------------------

--

-- Table structure for table ‘track‘

--

CREATE TABLE track (

track_id int(10) NOT NULL auto_increment,

track_title varchar(50) NOT NULL default ’’,

artist_id int(10) NOT NULL default ’0’,

album_id int(10) NOT NULL default ’0’,

track_no int(2) NOT NULL default ’0’,

track_length time NOT NULL default ’00:00:00’,

price_id int(10) NOT NULL default ’1’,

file_location varchar(200) NOT NULL default ’’,

track_size decimal(3,2) NOT NULL default ’0.00’,

discontinued enum(’Y’,’N’) NOT NULL default ’N’,

PRIMARY KEY (track_id)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

-- --------------------------------------------------------

--

-- Table structure for table ‘transaction‘

--

CREATE TABLE ‘transaction‘ (

transaction_id int(10) NOT NULL auto_increment,

transaction_date date NOT NULL default ’0000-00-00’,

total_price decimal(4,2) NOT NULL default ’0.00’,

PRIMARY KEY (transaction_id)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

-- --------------------------------------------------------

--

-- Table structure for table ‘transaction_line‘

--

CREATE TABLE transaction_line (

line_id int(10) NOT NULL auto_increment,

transaction_id int(10) NOT NULL default ’0’,

track_id int(10) NOT NULL default ’0’,

album_id int(10) NOT NULL default ’0’,

PRIMARY KEY (line_id)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1 ;

A.2.2 Source Schema-- phpMyAdmin SQL Dump

-- version 2.6.2

-- http://www.phpmyadmin.net

--

-- Host: localhost

-- Generation Time: May 13, 2006 at 03:33 PM

-- Server version: 4.1.11

-- PHP Version: 5.0.4

--

-- Database: ‘atom_kiosk‘

--

-- --------------------------------------------------------

--

-- Table structure for table ‘album‘

--

89

CREATE TABLE album (

album_id int(10) NOT NULL auto_increment,

artist_id int(10) NOT NULL default ’0’,

genre_id int(10) NOT NULL default ’0’,

release_date date NOT NULL default ’0000-00-00’,

cover_path varchar(200) NOT NULL default ’’,

album_title varchar(50) NOT NULL default ’’,

PRIMARY KEY (album_id),

KEY release_date (release_date),

KEY cover_path (cover_path),

KEY album_title (album_title)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1;

-- --------------------------------------------------------

--

-- Table structure for table ‘artist‘

--

CREATE TABLE artist (

artist_id int(10) NOT NULL auto_increment,

artist_name varchar(50) NOT NULL default ’’,

PRIMARY KEY (artist_id),

KEY artist_name (artist_name)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1;

-- --------------------------------------------------------

--

-- Table structure for table ‘genre‘

--

CREATE TABLE genre (

genre_id int(10) NOT NULL auto_increment,

genre_name varchar(50) NOT NULL default ’’,

PRIMARY KEY (genre_id),

KEY genre_name (genre_name)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1;

-- --------------------------------------------------------

--

-- Table structure for table ‘track‘

--

CREATE TABLE track (

track_id int(10) NOT NULL auto_increment,

album_id int(10) NOT NULL default ’0’,

track_title varchar(50) NOT NULL default ’’,

track_length time NOT NULL default ’00:00:00’,

file_location varchar(200) NOT NULL default ’’,

track_size decimal(3,2) NOT NULL default ’0.00’,

PRIMARY KEY (track_id),

KEY track_title (track_title),

KEY file_location (file_location)

) ENGINE=InnoDB DEFAULT CHARSET=latin1 AUTO_INCREMENT=1;

90

Appendix B

Experiment Data Sets

Rather than provide the raw SQL data for these tests, PHP scripts have been provided

which were used to generate the SQL data.

B.1 Load Test<?php

//The following script contains a set of simple for loops,

//one loop for each table, altering the number of iterations

//each loop performs will alter the number of SQL statements

//generated...

//initialse SQL string...

$sql="";

//genres...

for($i=1;$i<=5;$i++)

{

$sql.="INSERT INTO genre VALUES(".$i.", ’genre".$i."’); \n";

}

//artists...

for($i=1;$i<=100;$i++)

{

$sql.="INSERT INTO artist VALUES(".$i.", ’artist".$i."’); \n";

}

//albums...

$album_id = 1;

$g=1;

for($ar=1;$ar<=100;$ar++)

{

for($al=1;$al<=5;$al++)

{

$sql.="INSERT INTO album VALUES(".$album_id.", ".$ar.", ".$g.", ’".rand(1900, 2005)."-".rand(1,12)."-".rand(1,29)

."’, ’coverpath_for_album".$album_id."’, ’album_title".$album_id."’); \n";

$album_id++;

}

if($g==5)

{

$g=1;

}

else

{

$g++;

}

}

//tracks...

$track_id = 1;

91

for($al=1;$al<=500;$al++)

{

for($t=1;$t<=10;$t++)

{

$sql.="INSERT INTO track VALUES(".$track_id.", ".$al.", ’track_title".$track_id."’, ’00:".rand(2,10).":".rand(10,45)."’, ’"

."file_location".$track_id."’, ".rand(1,5).".".rand(0,99)."); \n";

$track_id++;

}

}

//location of output SQL file...

$fh = fopen("c:\source5605.sql", "w");

fwrite($fh, $sql);

fclose($fh);

echo("SQL script complete.");

?>

B.2 Operational Test<?php

//This PHP script generates SQL update commands

//in order to produce an SQL update script for the

//operational test.

$source_name="3019";

$sql="";

//tracks...

for($i=1;$i<=150;$i++)

{

$sql.="UPDATE track SET track_title = ’track_titleUPDATE_".$i."_".$source_name."’ WHERE track_id = ".$i."; \n";

}

//albums...

for($i=1;$i<=100;$i++)

{

$sql.="UPDATE album SET album_title = ’album_titleUPDATE_".$i."_".$source_name."’ WHERE album_id = ".$i."; \n";

}

//genres...

for($i=1;$i<=10;$i++)

{

$sql.="UPDATE genre SET genre_name = ’genreUPDATE_".$i."_".$source_name."’ WHERE genre_id = ".$i."; \n";

}

//artists...

for($i=1;$i<=100;$i++)

{

$sql.="UPDATE artist SET artist_name = ’artistUPDATE_".$i."_".$source_name."’ WHERE artist_id = ".$i."; \n";

}

$fh = fopen("../DB_SCRIPTS"."/".$source_name."_update.sql", "w");

fwrite($fh, $sql);

fclose($fh);

echo("SQL script complete.");

?>

B.3 Latency Test Data

The latency test data was generated using the same, PHP script as that used for the

operational test data, refer to Section B.2.

92

Appendix C

Example RSS Feed

<?xml version="1.0" encoding="utf-8"?>

<rdf:RDF xmlns="http://purl.org/rss/1.0/"

xmlns:dc="http://purl.org/dc/elements/1.1/"

xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">

<channel rdf:about="http://www.w3.org/2000/08/w3c-synd/home.rss">

<title>The World Wide Web Consortium</title>

<description>Leading the Web to its Full Potential...</description>

<link>http://www.w3.org/</link>

<dc:date>2002-10-28T08:07:21Z</dc:date>

<items>

<rdf:Seq>

<rdf:li rdf:resource="http://www.w3.org/News/2002#item164"/>

<rdf:li rdf:resource="http://www.w3.org/News/2002#item168"/>

<rdf:li rdf:resource="http://www.w3.org/News/2002#item167"/>

</rdf:Seq>

</items>

</channel>

<item rdf:about="http://www.w3.org/News/2002#item164">

<title>User Agent Accessibility Guidelines Become a W3C

Proposed Recommendation</title>

<description>17 October 2002: W3C is pleased to announce the

advancement of User Agent Accessibility Guidelines 1.0 to

Proposed Recommendation. Comments are welcome through 14 November.

Written for developers of user agents, the guidelines lower

barriers to Web accessibility for people with disabilities

(visual, hearing, physical, cognitive, and neurological).

The companion Techniques Working Draft is updated. Read about

the Web Accessibility Initiative. (News archive)</description>

<link>http://www.w3.org/News/2002#item164</link>

<dc:date>2002-10-17</dc:date>

</item>

<item rdf:about="http://www.w3.org/News/2002#item168">

<title>Working Draft of Authoring Challenges for Device

Independence Published</title>

<description>25 October 2002: The Device Independence

Working Group has released the first public Working Draft of

Authoring Challenges for Device Independence. The draft describes

the considerations that Web authors face in supporting access to

their sites from a variety of different devices. It is written

for authors, language developers, device experts and developers

of Web applications and authoring systems. Read about the Device

Independence Activity (News archive)</description>

<link>http://www.w3.org/News/2002#item168</link>

<dc:date>2002-10-25</dc:date>

93

</item>

<item rdf:about="http://www.w3.org/News/2002#item167">

<title>CSS3 Last Call Working Drafts Published</title>

<description>24 October 2002: The CSS Working Group has

released two Last Call Working Drafts and welcomes comments

on them through 27 November. CSS3 module: text is a set of

text formatting properties and addresses international contexts.

CSS3 module: Ruby is properties for ruby, a short run of text

alongside base text typically used in East Asia. CSS3 module:

The box model for the layout of textual documents in visual

media is also updated. Cascading Style Sheets (CSS) is a

language used to render structured documents like HTML and

XML on screen, on paper, and in speech. Visit the CSS home

page. (News archive)</description>

<link>http://www.w3.org/News/2002#item167</link>

<dc:date>2002-10-24</dc:date>

</item>

</rdf:RDF>

94