cs4105_the _research_project_stuart_clancy_and _ed_fitzpatrick

What is the future of the RDBMS in the Enterprise?

School of Computer Science and Statistics

TRINITY COLLEGE


Stuart Clancy

Edward Fitzpatrick

Degree Year

BSc (Hons) Information Systems

11th April 2011

A Dissertation submitted to the University of Dublin in partial fulfilment of the

requirements for the degree of BSc (Hons) Information Systems

Date of Submission: 11th April 2011


- III -

Declaration

We declare that the work described in this dissertation is, except where otherwise stated,

entirely our own work, and has not been submitted as an exercise for a degree at this or any

other university.

Signed:___________________

Stuart Clancy

Date of Submission:

Signed:___________________

Edward Fitzpatrick

Date of Submission:


- IV -

Permission to lend and/or copy

We agree that the School of Computer Science and Statistics, Trinity College may lend or

copy this dissertation upon request.

Signed:___________________

Stuart Clancy

Date of Submission:

Signed:___________________

Edward Fitzpatrick

Date of Submission:


- V -

Acknowledgements

We would like to acknowledge and thank Ronan Donagher, our project supervisor and Diana

Wilson, the acting course director for their support, guidance and understanding throughout

our research project.

We would also like to acknowledge the unfailing support of our families, who have

encouraged us throughout the years of our study; our employers and work colleagues, who

have been patient and flexible with working arrangements in order to allow us to complete

our studies; and close friends who on occasion are called upon to provide a welcome

distraction and perspective.

Signed:___________________

Stuart Clancy

11th April 2011

Signed:___________________

Edward Fitzpatrick

11th April 2011


- VI -

Abstract

Managing data and information has been feature of human activity since the first

acknowledged symbols were etched onto stones by Neolithic humans. Since the emergence

of the Internet data as an available resource to man and machine has been growing rapidly.

This dissertation looks at what this means for the traditional relational database management

system (RDBMS). It asks if there is a future for the RDBMS in enterprise information system

architecture. It also examines the early developmental years of RDBMS in order to gain an

insight as why it has enjoyed relative longevity within a rapidly changing technology

environment. New types of database and data management systems are discussed such as

NoSQL and other open source non-relational DBMS such as Hadoop and Cassandra. The

data volume and data type problem is absorbed into various sections under the umbrella term

‘Big Data’. Utility companies and social networking sites are two sectors where the

management of large data volumes is a growing concern are examined in the two case

studies. A separate chapter on the research methodology chosen by us is included. It provides

the necessary balance between subject matter and method as set out in the initial

requirements.

Keywords:

Relational Theory, DBMS, RDBMS History, NoSQL, Hadoop, Cassandra, Database Market,

Big Data, Research Methodology.


- VII -

Table of Contents

Abstract....................................................................................................................................VI

List of Figures...........................................................................................................................X

List of Tables.............................................................................................................................X

List of Abbreviations...............................................................................................................XI

Chapter One - Introduction.................................................................................................... 1

1.1 The Research Question ........................................................................................... 1

1.2 Document Roadmap ................................................................................................ 2

Chapter Two - Literature review, findings and analysis ......................................................... 4

2.1 Introduction ............................................................................................................. 4

2.2 RDBMS................................................................................................................... 4

2.2.1 History of the RDBMS ....................................................................................... 10

2.2.2 Main Features of ‘true’ RDBMS......................................................................... 13

2.2.3 IBM, Ellison and the University of California, Berkley....................................... 15

2.3 New Databases ...................................................................................................... 19

2.3.1 Features of NoSQL Databases ............................................................................ 20

2.3.2 Hadoop............................................................................................................... 23

2.3.2.1 Components of Hadoop ................................................................................... 24

2.3.3 Cassandra ........................................................................................................... 25

2.4 The market for RDBMS’ and Non-Relational DBMS’........................................... 27

2.4.1 Introduction ........................................................................................................ 27

2.4.2. RDBMS Market................................................................................................. 27

2.4.2.1 Vendor Offerings ............................................................................................. 28

2.4.4 Open Source Databases....................................................................................... 32


- VIII -

2.4.4.1 Non-RDBMS Market....................................................................................... 32

2.5 Case Studies .......................................................................................................... 36

2.5.1 Case Study 1- Utility Companies and the Data Management challenge ............... 36

2.5.1.1 Introduction ..................................................................................................... 36

2.5.1.2 Utilities............................................................................................................ 36

2.5.1.3 Smart Grid - The ESB case .............................................................................. 39

2.5.1.4 The Data Volume Problem............................................................................... 41

2.5.1.5 How one utility company is meeting the data volume challenge ....................... 44

2.5.1.6 What is the ESB doing? ................................................................................... 45

2.5.1.7 Conclusion....................................................................................................... 46

2.5.2 Case Study 2 - Social Networks – The migration to Non-SQL database models .. 47

2.5.2.1 Facebook Messages ......................................................................................... 48

2.5.2.2 Twitter - The use of NoSQL databases at Twitter............................................. 49

Chapter Three - Research Methodology .............................................................................. 52

3.1 Introduction ........................................................................................................... 52

3.2 The strategy adopted for researching the question. ................................................. 53

3.3 A Theoretical Framework ...................................................................................... 55

3.4 Research Design .................................................................................................... 57

3.5 Methodology - A Qualitative Approach ................................................................. 58

3.6 Methods................................................................................................................. 58

3.6.1 Method - Analytic Induction ............................................................................... 59

3.6.2 Method - Content Analysis ................................................................................. 59

3.6.3 Method - Historical Research.............................................................................. 59

3.6.4 Method - Case Study........................................................................................... 60

3.6.5 Method - Grounded Theory ................................................................................ 60

3.7 Ethics Approval .................................................................................................... 61

3.8 Audience .............................................................................................................. 61


- IX -

3.9 Significance of research......................................................................................... 61

3.10 Limitations of the research methodology ............................................................. 62

3.11 Conclusion....................................................................................................... 62

Chapter Four - Conclusions, Limitations of Research and Future Work............................... 63

4.1 Introduction ........................................................................................................... 63

4.2 Conclusions ........................................................................................................... 64

4.2.1 RDBMS.............................................................................................................. 64

4.2.2 New DB’s........................................................................................................... 64

4.2.3 Market ................................................................................................................ 65

4.2.4.1 Case Study 1 - Utility Companies .................................................................... 66

4.2.4.2 Case study 2 - Social Networks........................................................................ 66

4.3 Future Research..................................................................................................... 67

4.3.1 NoSQL ............................................................................................................... 67

4.3.2 Case Studies ....................................................................................................... 68

4.3.3 Business Intelligence .......................................................................................... 68

4.3.4 Research Methodology ....................................................................................... 68

4.4 Limitations of the Research ................................................................................... 69

4.5 Final thoughts........................................................................................................ 70

REFERENCES............................................................................................................ 71

APPENDIX 1.............................................................................................................. 85


- X -

List of Figures

Figure 2.1 - A simplified DBMS .......................................................................................... 9

Figure 2.2 – Overview of a generic Smart Grid ................................................................... 40

Figure 2.3 - ESB proposed implementation of Advanced Metering ..................................... 41

Figure 2.4 – Smart Meters transaction rate…………………………………………………..42

Figure 2.5 – Smart Meters data size………………………………………………………….42

Figure 2.6 - Sources of Smart Grid data with time dependencies……………………………43

List of Tables

Table 2.1 - Impact of unstructured data on productivity............................................................8

Table 2.2 – Example of redundant rows in a database.............................................................14

Table 3.1 - Key concepts in Qualitative and Quantitative research methodologies.................54

Table A.1 - Edgar Codd’s original relational model terms…………………………………..85


- XI -

List of Abbreviations

ACID – Atomicity, Consistency, Isolation and Durability.

ACM – Association of Computing Machinery.

BA - Business Analytics.

BASE - Basically Available, Soft state, Eventual consistency.

BI - Business Intelligence.

BSD - Berkeley Software Distribution.

CA - Computer Associates.

CAP - Consistency, Availability and Partition tolerance.

CIS – Customer Information System.

CODASYL – Conference on Data Systems Language.

CRM - Customer Relationship Management.

DBMS – Database Management Sys

DMS – Distribution Management System.

DW- Data Warehousing.

ERM - Enterprise Relationship Management.

GB - Gigabyte

GBT - Google Big Table.

GFS - Google File System.

GIS - Geographical Information System.

HA - High Availability.

HDFS - Hadoop Distributed File System.

IA – IBM’s Information Architecture.


- XII -

IBM - International Business Machines.

ISV - Independent Software Vendor.

IT – Information Technology.

KB - Kilobyte

MB - Megabyte

MDM - Meter Data Management.

MPL - Mozilla Public Licence.

MR - MapReduce.

NoSQL – ‘No’ SQL or more often ‘Not Only’ SQL.

OEM - Original Equipment Manufacturer.

OLAP - Online Application Processing.

OLTP - Online Transaction Processing.

OMS - Outage Management System.

OS - Operating System.

OSI - Open Source Initiative.

PB - Petabyte

PDC - Phasor Data Concentrators

PLM - Product Life-cycle Management.

RDBMS - Relational Database Management System.

SCADA - Supervisory Control and Data Acquisition.

SOA - Service Oriented Architecture.

SQL - Structured Query Language.

TB - Terabyte

Chapter One - Introduction

Humans have being storing information outside of the brain probably before the first

consistent markings on a bone were found in Bulgaria dating from more than a million years

ago. Certainly so since the later Neolithic clay calculi bearing symbols representing

quantities, the cave paintings at Lascaux over 17,000 ago; through to the invention of the

moveable type printing press and eventually to the first computers. Since the emergence of

the information age of the last fifty years or so the amount of data transferred and stored in

computers has grown rapidly. Research from the International Data Corp (IDC) in 2008 puts

that growth at 60% per annum (The Economist, 2010).

An added complexity is that executive strategies now have business intelligence for

competitive edge as a key goal. Data management systems that for many years have been the

old reliable work horse toiling away in the back end somewhere are once again playing a key

role in driving business growth. The question is, are they still capable of carrying out this new

and challenging task? This dissertation asks that question and more specifically what is the

future for the Relational Database Management System (RDBMS) in the Enterprise?

The data volume problem now has a name ‘Big Data’. Its nascence coincides with the growth

of the Internet. Alternative solutions to traditional RDBMS to deal with ‘Big Data’ soon

followed. Much of these solutions are either based on multi-parallel processing (MPP a.k.a

distributed computing) or flipping the row store of RDBMS into column store systems. More

recently MPP solutions are being positioned not as alternatives but complements to RDBMS

(Stonebraker et al., 2010). Add to this mix a dynamic data management market where

vendors are acquiring new technology, merging with each other, adopting open source and

creating hybrid stacks in an effort to gain advantage in a market deemed to grow to $32

billion by 2013 (Yuhanna, 2009).

1.1 The Research Question

Time was taken to carefully frame our research question so as to provide a clear path of

exploration on the subject. The subject could have been framed as a predicated hypothesis

such as: “The future for RDBMS in the Enterprise is looking bright” or a contrary statement

“The end is nigh for RDBMS”. We chose to frame our research as an open ended question to


Page 2

allow for a broad exploration of the subject with no preconception of the outcome. The

broadness of scope however is necessarily tempered by restricting our research to those

organisations defined as enterprises. There is difficulty here as there is no overarching

definition for an enterprise organisation. However, it is necessary to provide some clear

defined boundaries around the term. For this dissertation an enterprise is defined not by size

or function alone.

Enterprises for us are organisations where the scale of control is large. They include

companies with a large amount of customers and employees, as well as companies that

control a large infrastructure or several functional units. Enterprises have one top-level

strategy to which all other functional units are aligned. The last point is an important

characteristic of an enterprise for our dissertation as it applies to decision making for

acquiring information management systems.

The presence of the word 'future' is central to locating the research in an exploratory and

intuitive research domain. It prompts looking into the past in an attempt to explain the present

and predict the future. It forces an open mind and questioning approach. It enables the

creation of new ideas which are either taken on or set aside for another time. The chapters

and sections are set out below in an attempt to follow this map in the view that the journey is

the objective rather than the destination.

1.2 Document Roadmap

In writing this dissertation a balance was sought between addressing the issues raised by the

initial question and the research methodology chosen. The bulk of this dissertation therefore

centres on those two areas. In this chapter we introduce the concept of our research and why

we feel it is interesting to us. The research question is explained and the objective is put in

context. Chapter two contains the literature review. The chapter begins with an outline of

RDBMS, its features and history of development. Particular attention is given to the role of

IBM in the development of RDBMS. The chapter moves on to discuss new databases and

data management systems. A section on the DBMS market follows and presents an overview

of the current vendor offerings. The market section does not attempt a comparison of

available systems as this work was carried out in greater detail by others more expert than us.

Throughout the dissertation we refer the reader to such work where it is not feasible for us to

reproduce it.


Page 3

Two case studies are included for the benefit of putting the research question in a practical

context. The two areas chosen involve contrasting enterprises. On one hand there is the

relatively long established utilities sector and on the other the new phenomenon of social

networking and its associated companies. Even though they operate in widely different

markets generating different types of data, they both share similar problems when it comes to

managing large amounts of data. Likewise, both are trying to get to grips with extracting

value out of data for competitive edge.

Chapter three discuss the research methodology chosen by us. It deserves a chapter to itself in

view of the objective of this dissertation. The chapter begins with an introduction on research

theory. It then moves to a discussion on our research strategy. A research framework is

introduced as a model of our strategy. The different methodologies available are outlined and

our chosen option is explained. Next, a group of related research methods are outlined and

the reason for their selection is stated. Short sections on ethics approval, audience and the

significance of the research follows before a final section on the limitations of our chosen

research methodology closes the chapter.

The final chapter attempts to pull together the conclusions and findings from the all the

previous sections. Relevant research threads and ideas not covered in sufficient detail in the

dissertation are mentioned. The last sections present a summary of the limitations of the

overall research and our final concluding thoughts.


Page 4

Chapter Two - Literature review, findings and analysis

2.1 Introduction

In this section the focus is on RDBMS. The intention is to provide an overview of its defining

features. It is not an in-depth technical analysis of RDBMS and we would refer the reader to

better papers on the subject such as those published in the Communications of the

Association of Computing Machinery (ACM) of which we refer to several times. It also sets

out the background to the development of RDBMS. Within that context an interesting

discovery is made with respect to IBM’s initial role in the development of database

management systems. For the purpose of exploring the question on the future of RDBMS

some associated concepts are discussed such as data types, ‘true’ RDBMS, and whether or

not the past can teach us something about the future.

2.2 RDBMS

Databases

It is unfortunate that in realm of Information Technology (IT) acronyms are not always self-

explanatory. Many such acronyms don’t travel outside of their specific domain very well.

Take for example DQDB or Distributed Queue Dual Bus; outside of the world of high speed

networks this may seem to be a very efficient urban transport vehicle. Luckily the term

RDBMS contains within itself the individual components which define it: a system (S)

composed of a database (DB) where information is stored by creating relationships (R)

between data elements and which can be managed (M) by users. It is helpful at this point to

explain the hierarchy, at least, of each of these components.

Throughout this dissertation data (and datum-singular) and information are taken to be a

classifications of entities stored in a system. Data being lowest in the sense of the taxonomy

data – information – knowledge - wisdom (sometimes called understanding) but not lower in


Page 5

real value; a single digit integer may be enough data to invoke the required wisdom to make

an important decision. For the purpose of simplicity, data here means a binary entry (such as

yes or no, 1 or 0), or a nominal entry (such as dog, 470, Smith, XRA9000 etc.). An analogy

from biology might see data as the molecules which make up a cell of information. The word

‘molecules’ is carefully suggested instead of ‘atoms’ given that ‘atomicity’ has particular

significance for relational databases. Permitting an extension of the analogy would see a body

of knowledge built from the cells of information. It would be unwise to stretch the analogy

further to address wisdom. Unhelpfully, the words ‘data’ and ‘information’ are often

interchangeable terms in research literature. Some examples of this are the concepts ‘Big

Data’ and ‘unstructured data’ for what really ought to be called information. For this reason

and for the purpose of consistency this dissertation will hold with the literature and consider

the two terms as one except where a distinction is required.

A database has been defined in a number of sources as a “collection of related data or

information” (Bocij et al. 2006, p. 153; Elmasri and Navathe, 1989, p. 3).

The Oxford English dictionary defines a database as a “structured set of data held in a

computer” (OED). However, the Cambridge Advanced Learner’s online dictionary (2011)

definition is perhaps closer to a contemporary definition:

“A large amount of information stored in a computer system in such a way that it can

be easily looked at or changed”.

It is noted that the definition in the later online edition of the Cambridge (2011) does not have

any explicit reference to relational, structured or organised data. This looser definition

reflects the changing nature of data management as newer types and bigger volumes of data

are being captured.

Finally, a definition from the business world which expands on the above mentioning

different types of data and hints at the issues regarding scale:

A database is “a systematically organized or structured repository of indexed information

(usually as a group of linked data files) that allows easy retrieval, updating, analysis, and

output of data. Stored usually in a computer, this data could be in the form of graphics,


Page 6

reports, scripts, tables, text, etc., representing almost every kind of information.” (Business

Dictionary, 2011).

Structured and unstructured data.

The last definition above alludes to unstructured data. Unstructured data is data in the form of

text (words, messages, symbols, emails, sms texts, reports) or bitmaps (images, graphics). A

good example of the growing relevance of unstructured information is a Facebook page

containing images, short messages, links, and chunks of text that can be altered at any time.

Structured data by contrast is any data “that has an enforced composition to the atomic data

types” (Weglarz, 2004). Atomicity is a characteristic of stored entity which is not divisible

(Elmasri and Navathe, 1989, p. 41). Atomicity is a key necessity for defining structured data

and is what relational databases rely on to make relationships. A database designer can decide

on the exact rules for the structured data and the level of atomicity required. As an aside, it is

often this small amount of flexibility in the design of the data model which is responsible for

the creation of many ‘bad’ databases. Structured data is data that is consistent, unambiguous

and conforms to a predefined standard. Structured data will be examined in more detail later

under the section discussing RDBMS. A third type is semi-structured data. This is data held

in a standard format such as forms, spreadsheets and XML files. This type of data can be

parsed by computer programs more easily than unstructured data due to the data generally

being located in a fixed and known place, even if the data itself is not atomic.

The problem of structured versus unstructured data types can be stated using the example of

two schools. One school grades students in the traditional way by giving a numerical grade

following examination. Another school does not give numerical grade to students, preferring

a method whereby students are furnished with a qualitative report on their overall

performance. The former is structured data as the meaning of a grade of 82% is consistent in

the context of the schools grading system. It can be easily recorded, measured, and compared

to other grades internally or from other schools using the same system. The report format

however is unstructured and comparison with a numerical grading system is not so easy.

Gleaning relevant information from a text report is complex and involves semantic analysis

with or without the help of technology.


Page 7

What does this mean for enterprises?

Eighty percent of information relevant to business is unstructured and is mostly textual form

(Langseth in Grimes, 2011). Seth Grimes an analytics expert with the Alta Plana Corporation

has previously investigated this claim. He concludes that even if the origins of the 80% are

elusive (Grimes tracks back as far as the 1990’s) experience supports the claim (Grimes,

2011). Patricia Selinger (IBM and ACM Fellow) who has worked on query optimisation for

27 years puts unstructured data in companies at about 85% (Selinger, 2005). Even assuming a

lower figure than 80% for unstructured data in larger enterprises, where much information is

in structured forms held in traditional transaction based databases, there is still the problem of

how to leverage competitive advantage out of the nuggets of information buried in the rich

seams of unstructured data. Businesses are realising that the chances of extracting valuable

wisdom from traditional data stores using stale analysis methods and tools are diminishing

and that new ideas are needed.

Unstructured data is growing faster than structured data, according to the "IDC Enterprise

Disk Storage Consumption Model" 2008 report, “while transactional data is projected to

grow at a compound annual growth rate (CAGR) of 21.8%, it's far outpaced by a 61.7%

CAGR prediction for unstructured data” (Pariseau, 2008).

Kevin McIssac (2007) of Computer World magazine puts it into perspective:

“Unfortunately business is drowning in unstructured data and does not yet have the

applications to transform that data into information and knowledge. As a result staff

productivity around unstructured data is still relatively low.”

McIssac gives examples of the impact of unstructured data on productivity citing research

from various sources. Table 2.1 below summarises those impacts:


Page 8

Time/Volume Impacts on Research Source

9.5 hours per

week

Average time an office worker spends

searching, gathering and analysing

information (60% of that on the Internet)

Outsell

10% of working

time

Time professionals in creative industry

spend on file management.

GISTICS

600 e-mails per

week

Sent and received by a typical business

person.

Ferris Research

49 minutes per

day

Time an office worker spends managing e-

mail. Longer for middle and upper

management.

ePolicy Institute

Table 2.1 - Impact of unstructured data on productivity.

Where are the joins?

It seems that a reappraisal of what a database is or needs to do is well under way. If this is so,

then this reappraisal logically extends to the database management system. Structured data

can be joined to other structured data to form concatenations of information using a query

language based on mathematical operations. Things get a little more ‘fuzzy’ with

unstructured data. Stock market analysts might like to try querying an online media sources

for all posts where the word ‘oil’ is used but only in the context of the recent crises in Libya.

How unstructured and unrelated data is to be stored in the system and how meaningful

information can be retrieved back out of that same system are questions many organisations

are now asking – but, similar questions were asked before and the past may hold some

lessons for us.


Page 9

A DBMS

In its simplest definition a DBMS is a set of computer programs that allows users to create

and maintain a database (Elmasri & Navanthe, 1989 p. 4). Bocij et al. (2006, p. 154) expands

on this definition a little: “One or more computer programs that allow users to enter, store,

organise, manipulate and retrieve data from a database.”

(Source: Elmasri and Navathe, 1989 p. 5)

Figure 2.1 - A simplified DBMS

Figure 1 above shows the key components of a data management system. A detailed

description of each of the components of the system is not necessary for our purpose but

briefly they are:

• Application programs with which users can interact with the stored data.

• Software programs for processing and accessing the stored data.

• A high-level declarative language interface for executing commands (commonly

known as a query language).


Page 10

• A repository for storing data.

• A store of information related to the data for classifying or indexing purposes (meta-

data)

• Hardware suitable for each of the above functions

• Users (includes database administrators and designers)

2.2.1 History of the RDBMS

To understand why newer types of databases and data management systems are emerging and

taking hold it seems reasonable to explore why RDBMS’ came into existence, as well as their

usefulness and relative longevity.

The 1960’s BC (Before Codd)

Data management systems existed before Edgar Codd, while at IBM, wrote his seminal paper

published in 1970 called “A Relational Model of Data for Large Shared Data Banks”. Codd’s

paper presented a new database model and hence introduced the world of database

management to relational theory (Codd, 1970). In his paper Codd discusses the limitations of

the existing hierarchal and network data systems and introduces a query language based on

relational algebra and predicate calculus.

In a later important paper he described 12 rules for a relational database management system

(Codd, 1985). Systems that satisfy all 12 rules are rare. In fact, it is argued that no truly

relational database systems existed in wide commercial production even a decade after

Codd’s vision (Don Heitzmann in Thiel, 1982), and even up to more recently (Anthes, 2010).

A brief description of the two data management systems (of whose limitations Codd

addressed) is a useful precursor to a broader description of relational DBMS’.

Hierarchal Data Models

Hierarchal data models are similar to tree-structured file systems in that the data is stored as

parent-child relationship. Codd asserts that hierarchal and network based DBMS’ were not

data models in comparison to his more formalised Relational model. (Codd, 1991). For

simplicity the word ‘model’ is maintained for the data structure of all systems under


Page 11

discussion here. The model made sense to organisations that were naturally hierarchal in

nature - a legacy of Henri Fayol and his 14 management principles, popular in the 1960’s and

still used in organisations today (Stoner and Freeman, 1989; Tiernan et al., 2006). A

hierarchal data model can be presented as a tree-structure of parent-child relationships or as

an adjancy list. For example: a root entity with no parent might be SCHOOL; STUDENT is a

child of SCHOOL; GRADE is a child of STUDENT. STUDENT is also a child of COURSE.

In this type of structure data can be replicated many times in different branches of the tree, a

relationship of ‘one to many’ or 1:N. A ‘modified preorder tree traversal' algorithm is used to

number each entity on the way down through the tree-structure (left value) and again on the

way back up to the root (right value). Thus, making the query operations more efficient in

navigating around the data (Van Tulder, 2003).

The first hierarchal DBMS was developed by IBM and North American Aviation in the late

1960’s (Elmasri and Navathe, 1989 p. 278). IBM imaginatively called it Information

Management System (IMS) and Frank Hayes dates its roll out to 1968 (Hayes, 2002). Elmasri

and Navathe cite McGee (1977) for a good overview of IMS (1989, p. 278).

Network Data Models

As can be seen in the hierarchal data model above a child could have many parents. A

STUDENT for instance, can take more than one MODULE in any COURSE YEAR. In a

hierarchal structure the same STUDENT would appear under each of the MODULE trees. In

other words many students can take many modules. The Network data model was a further

development of the hierarchal model to address the issue of managing ‘many to many’ (M:N)

relationships. The Conference on Data Systems Languages (CODASYL) defined the network

model in 1971 (Elmasri and Navathe, 1989).

Where the underlying principle of the hierarchal model was parent-child tree structures, in a

network model it is set theory. Records are classified into record types and given names.

These records are sets of related data. Record types are akin to tables in a relational database

model. The intricacies of set theory are beyond the scope of this dissertation; however, it

suffices to say that complex data combinations can be achieved by nesting record types

within other record types – data sets as members of other data sets. If this were possible in a

relational database it would be like having tables within tables within tables.


Page 12

The earliest work on a network data model was carried out by Charles Bachman in 1961

while working for General Electric. His work resulted in the first commercial DBMS called

Integrated Data Store (IDS) which ran on IBM mainframes. The system was cumbersome and

was eventually redeveloped by an IDS customer, BF Goodrich Chemical Company into what

was called IDMS (Hayes, 2002). With Bachman on board as a consultant, IDMS was

eventually commercialised by Cullinane/Cullinet Software in the 1980’s. Cullinet was bought

by Computer Associates (CA) in 1989. IDMS is a current offering by CA for mainframe

database management systems today. Charles Bachman received the Turing Award in 1973

for his pioneering work in developing the first commercially available data management

system, for being one of the founders of CODYSYL and for his work on representation

methods for data structures (Canning in Bachmann, 1973).

The 1970’s

Adabas DBMS was developed in the 1970 by Software AG. It has an interesting feature of

relevance to this dissertation. Adabas was designed to run on mainframes for enterprises with

large data sets and requiring fast response times for multiple users. One of its main features is

that it indexes data using inverted-list type indexing.

Adabas also features a data storage address convertor which avoids data fragmentation. Data

fragmentation can occur when a record is updated with additional data. The record is now too

large to be stored in the original location. The data can be moved to a new location but the

indexes still expect the data to be in the same place so they also have to be updated. The

address convertor does this. The alternative as used by other systems is data fragmentation;

part of the data is stored in the original location with a pointer to where the remainder is

stored. Fragmentation and pointer methods however require additional processing and hence

slower response times. The problem of using pointers in systems predating RDBMS instead

of storing data directly (in tuples as is done in RDBMS) is referred to by IBM’s Irv Traiger

(in McJones, 1997 pp. 16-17).

According to Carl Monash, Adabas’ inverted-list indexing is the favoured method for

searching textual content. New ideas regarding the management of text (unstructured data)

has according to Monash “at least the potential of being retrofitted to ADABAS, should the

payoff be sufficiently high” (Monash, Dec 8 2007).


Page 13

Edgar Codd and the birth of the Relational Model

Codd’s text ‘The Relational Model for Database Management’ of 1990 (version 2, 1991)

brings together his ideas set out in his previous papers regarding Relational Data Model for

managing databases. In it he places his model as solidly based on two areas of mathematics:

Predicate Logic and Relational Theory. In order for the maths to work effectively, there are

four essential concepts associated with the relational model: domains, primary keys, foreign

keys and no duplicate rows. In particular, the importance of Domains has not been

understood fully or adopted by later commercial versions of his RDBMS (Codd 1991, pg18).

Also, two early prototypes IBM’s System R and Berkley University and Michael

Stonebraker’s INGRES were not concerned about the need to address the issue of duplicated

rows. The designers of both those systems felt that the additional processing required to

eliminate duplicate rows was unnecessary given the relative benign presence of duplicate

rows (Codd, 1991, p. 18). Codd’s purer model based on mathematic principles gave way to

the more pragmatic needs of the commercial world.

2.2.2 Main Features of ‘true’ RDBMS

The main features of a Relational DBMS as proposed by Codd, distinguishes a ‘true’

Relational DBMS from other DBMS’. Based on his earlier paper setting out his 12 Rules

(1985), they are summarised as follows:

• Database information is values only and ordering is not essential (meta data while

required should not be of concern to the everyday user; pointers are not used)

• Data management is not dependant on position within the structure (contrast with

Hierarchal and Network models).

• Duplicate rows are not allowed.

• Information should be capable of being moved without impact on the user.

• Three level architecture of the RDBMS – base relations, storage, views (derived

tables).

• Declarations of domains as extended data types.


Page 14

• Column description should be akin to the domain it belongs to (i.e. a good naming

convention).

• Each base relation (R-Table) should have one and only one primary key column,

where null value entries are not allowed.

• RDBMS must allow one or more columns to be assigned as foreign keys.

• Relationships are based on comparing values from common domains.

This last point is crucial to understanding Codd’s intention. Only values from common

domains can be properly compared – currency with currency, euro with euro, date with date,

integer with integer etc. The basis for this lies with the nature of the mathematical operators

used in the system. Consistency of data types and strict rules are therefore vital for the

effective operation of the system. Herein lays one of the difficulties presented to designers of

commercial versions of Codd’s RDBMS. Users of data management systems are presented

with real world scenarios where consistency is not always practical. It would be ridiculous to

ask members of a social networking site to use standard forms for communicating so that the

DBMS could store the relevant information appropriately. Even closer to the relational

database world a transaction record could be created for a person called William Thomas as

follows:

Instance Surname Forename Address DOB ID Order No

1 Thomas William 22, Greenview Street 12/06/1945 1234 104

2 Thomas Bill 22 Greenview St. 12/06/1945 1365 104

3 Thomas William H. 22, Greenview Street 12/06/1945 3456 104

Table 2.2 – Example of redundant rows in a database

As can be seen in this simple example above, the database treats these as three distinct and

unique records, even though the intention is that only one record for this person should exist.

The result impacts on the size, processing speed and integrity of the system. Techniques to

address such problems (primarily data normalisation) were developed almost from the

beginning, in the early 1970’s by Codd and later by Raymond Boyce and Codd (Elmasri and


Page 15

Navathe, 1989, p. 371). Database normalisation is beyond the scope of this dissertation,

however the salient point and (and the reason for our initial hypothesis) is that the nature and

amount of unstructured data flowing in the electronic ether has pushed RDBMS and its

associated control and optimisation processes to the limits of their capabilities.

Debashish Ghosh of Anshin Software while advocating the merits of non-relational models

nevertheless puts it fairly…

“A relational data management system (RDBMS) engine is the right tool for handling

relational data used in transactions requiring atomicity, consistency, isolation, and

durability (ACID). However, an RDBMS isn’t an ideal platform for modelling

complicated social data networks that involve huge volumes, network partitioning,

and replication”. (Ghosh, 2010)

The above discussion is intended to provide an important distinction between Edgar Codd’s

original theory of a relational data management system and subsequent versions developed

for the commercial enterprise market (mainframe computer market at that time). The

importance of the mathematical principles (Relational Algebra and Calculus) behind Codd’s

ideas are not underestimated, nor are the associated operations based upon those principles, in

fact they are key to understanding why Codd at the time persisted in pushing for a full and

true implementation of his model, and it may also explain also why he stepped back from the

first experiments in commercialising his ideas (Chamberlin and Blasgen in McJones, 1997 p.

13). Brevity here forces us to move on to look at two of the earliest commercial versions of

RDBMS that by no accident are also the two market leaders today.

As an aside, Appendix 1 presents of useful comparison of the key terms from Codd’s original

intended meaning and their relationship to other systems.

2.2.3 IBM, Ellison and the University of California, Berkley

IBM

One artefact cited several times in this section on the history of data management systems is a

transcript from a reunion meeting in 1995 of some of the original IBM research employees,

who during the 1970s and 1980s were at the coal face of data management development. The

article edited by Paul McJones is entitled “The 1995 SQL Reunion: People, Projects, and


Page 16

Politics” (McJones, 1997). At first, what seems like the convivial reminiscences of middle

aged ex IBM colleagues, in fact turns out to be a rather more interesting illumination of the

context around the timelines for the development of some of the most important ideas to

emerge, as well as the historically important players and products from the realm of database

management. Some of the key people attending the reunion and contributing to the discussion

are: Donald Chamberlin, Jim Gray, Raymond Lorie, Gianfranco Putzolu, Patricia Selinger,

and Irving Traiger. All are IBM and ACM Fellows and award winners for their work. Jim

Gray, fellow Berkley graduate and mentor to Michael Stonebraker was given the ACM

Turing Award in 1998 for his work on transaction processing (ACID) (Stonebraker, 2008).

Patricia Selinger was awarded the ACM Edgar Codd Innovation Award for her work in query

optimisation. Their contributions were vital to the features of commercial RDBMS which has

ensured its longevity thus far and possibly for many years yet.

IBM and System R

Midway through the 1970s IBM’s San Jose based research lab began working on a project

called System R. Like many IBM research projects at the time it came out of different task

groups working on related areas such as data language, data storage, optimisation, concurrent

users, and system recovery. System R was relational based and combined work from various

groups. System R as a commercial RDBMS was installed in Prat & Whitney Aircraft

Company in Hartford Connecticut in 1977 where it was used for inventory control. However,

IBM was not yet interested in releasing it as fully featured product. At that time the big IBM

cash cow was IMS (its mainframe Network model DBMS mentioned earlier). And the

research focus was on a project called Eagle – a replacement for IMS with all the new

features of recent discoveries. With the pressure off, the System R developers plugged away,

aiming it towards the lower midrange product line (Jolls in McJones, 1997, p. 31). Two

things happened at the time which resulted in the focus coming back on System R and getting

it ready for market (McJones, 1997, pgs 33-34). Firstly, IBM was starting to loose ground to

new mini computers (Gray in McJones, 1997, pg 20) and secondly the Eagle project was

hitting a wall. System R unlike Eagle was relational and already pitched towards the smaller

computer range. The System R star did not shine for long and it was replaced by DB2 with

Release 1 in 1980. IBM fully embraced relational DBMS with Release 2 around 1985 (Miller

in McJones 1997, p. 43). DB2 is IBM’s current offering and is mentioned again under the

section on the RDBMS market.


Page 17

The Birth of SQL

In and around the same time that System R was being developed, the language research team

at IBM, Relational Data Systems (RDS) took on Codd’s two mathematical based languages

for data management, relational algebra and relational calculus. By their own admission they

found these mathematical notations too abstract and complex for general use. They developed

a notation which they called SQUARE (Specifying Queries as Relational Expressions),

(Chamberlin in McJones et al., 1997 p. 11)

SQUARE had some odd subscripts so a regular keyboard could not be used. RDS further

developed it to be closer to common English words. They called the new version Structured

English Query Language or SEQUEL. The intention was to make interaction with databases

easier for non-programmers. However its biggest impact came later when Larry Ellison (co-

founder and CEO of Oracle) read the IBM published papers on SEQUEL and realised that

this query language could act as an intermediary between different systems (Chamberlin in

McJones et al., 1997 p. 15). It was the RDS team at IBM who renamed it to SQL following a

trademark challenge to the term SEQUEL from an aircraft company (McJones et al, 1997, p.

20)

INGRES

In parallel with the work going on at IBM, the University of California at Berkley had a

project developing a system called INGRES (short for Interactive Graphics Retrieval

System). Michael Stonebraker who was at Berkley in 1972 was developing a query language

called QUELL. Stonebraker knew fellow Berkley graduates at IBM San Jose and more

importantly knew of their work. INGRES used QUELL whereas IBM and Larry Ellison’s

project at Software Development Laboratories (later Oracle) used SQL. Subsequent off

spring of the INGRES family are Sybase and Postgre (post Ingres). Incidentally, Microsoft

struck a deal with Sybase to use their code for their new extended operating system.

Recalling that the Sybase people were brought up in the QUELL tradition under Stonebraker,

Microsoft preferred SQL. They eventually fell out and Microsoft who now owned the Sybase

code ended up developing Microsoft SQL Server (Gray in McJones, 1997 p. 56).


Page 18

Oracle

In 1977 Larry Ellison, Bob Miner and Ed Oates founded Software Development Laboratories

(SDL), the precursor to Oracle Corporation. SDL based its system on a technical paper in an

IBM journal (Oracle History, 2011). That was Edgar Codd’s 1970 seminal paper setting out

his model for a RDBMS (Traiger in McJones et al., 1997). SDL’s first contract was to

develop a database management system for the Central Intelligence Agency (CIA) - the

project was called ‘Oracle’. SDL finished that project a year early and used the time to

develop a commercial RDBMS putting together the work done by IBM research on relational

databases and as mentioned above another project on working on the query language called

SEQUEL. While Ellison and SDL benefited from the work done at IBM they still had to do

all the coding. The resulting product was faster and a lot smaller than IBM’s System R. The

first officially released version of Oracle was version 2 in 1979.

Brad Wade jokes about Edgar Codd’s influence on Oracle - on Codd being made an IBM

Fellow in 1976, “It’s the first time that I recall of someone being made an IBM Fellow for

someone else’s product” (Wade in McJones, 1997, pg 49.)

It appears that many new enterprises sprang from the well of knowledge existing at IBM

during the 1970’s and 1980’s. Had the IBM research units not had so much talent, nor not

allowed publication of key papers at the time, the database world might look very different

today. Patents on software were prohibited by IBM, and also in fact by Supreme Court law

until 1980 (Bocchino, 1995). According to Franco Putzolu, IBM Research at that time and up

until 1979 were “publishing everything that would come to mind” (in McJones, 1997, p. 16).

Mike Blasgen argues that the outside interest in the published research was one reason why

the corporate machine of IBM began to notice some of the lesser research projects (in

McJones, 1997 p. 16).

It is hoped that the above overview gives the reader some understanding of the related threads

that developed out of Charles Bachman’s initial work on data management systems, through

IBM via Edgar Codd and out into the wide world via IBM research department’s open

attitude to sharing knowledge, of which Larry Ellison’s Oracle benefited greatly. Berkley

played its role also in the providing a common alma mater for young enthusiastic developers

to discuss ideas. It is an interesting irony that when we think of ‘open source’ we envision a


Page 19

recent phenomenon, however, IBM during the 1970’s would appear to have been a little

more open, for whatever reasons, than is usually accredited to them.

2.3 New Databases

This section will explore the development of new DB’s that have emerged on the database

market over the past decade, and what impact these DB’s will have on the general database

market as a whole.

What are ‘New DB’s’?

Traditional databases rely on a relational model in order to function. That is, they follow a set

of rigid rules to ensure the integrity of the data in the database. Most RDBMS models follow

the set of rules, originally outlined by Edgar Codd (1970).

New NoSQL database models don’t follow all of the rules set down by Codd. While

RDBMS’ models follow the set of properties called ACID as previously stated, NoSQL

database models do not. They follow any number of database properties including BASE

(Basically Available, Soft state, Eventual consistency) (Cattell, 2011) and CAP (Consistency,

Availability and Partition tolerance).

Why the development of NoSQL model databases?

Development of NoSQL databases was as a result of the evolution of the World Wide Web,

and the desire of individuals and companies/organizations to generate data, large amounts of

it (White, 2010, p. 2). By collecting data, organizations then had extract value from that data

in order to be successful in whatever field they participated, in the future.

The problem organizations faced in extracting value from that data were twofold:

1. As storage capacities increased, the means of transferring the data to the drive(s) did

not keep up. Twenty years ago, a hard drive could store 1.3 GB of data, while the

speed at which the entirety of the data could be accessed was 4.4 MB per second;

about five minutes to access it all. Today, 1 TB hard drives are the norm, but access


Page 20

speeds are about 100 MB per second; an access speed decrease of a factor of 30

(White, 2010, p. 3).

A means of getting around this bottle neck was the introduction of disk arrays,

whereby data could written and read from multiple disks in parallel. The drawback to

this was the possibility of hardware failure, whereby a disk or machine would fail and

the data lost (White, 2010, p. 3). Redundancy (various options of RAID being the

most famous examples) solved some of these problems but not all (Patterson, 1988).

2. The second problem is that with multiple disks, relational database models, with their

inbuilt consistency requirements, are unable to access data quickly enough when the

data is spread across multiple disk drives. RDBMS systems may not be able to allow a

query to access certain data if that data is already in use by another program or user

(Chamberlin, 1976).

2.3.1 Features of NoSQL Databases

In order for a Database to be considered a NoSQL database, it first must not comply with the

entirety of ACID properties. Amongst the features that define NoSQL databases include

Scalability, Eventual Consistency and Low Latency (Dimitrov, 2010). A key feature of

NoSQL databases is a “shared-nothing” architecture. This means databases can replicate and

partition data across multiple servers. In turn, this allows the databases to support a large

number of simple read/write operations per second (Cattell, 2011).

Scalability

With traditional RDBMS systems, a database was usually required to scale up, that is, switch

over to a newer, larger capacity machine, if the database is to expand capacity (Cattell, 2011).

One of the features designed into some NoSQL databases is their ability to scale to large data

volumes without losing the integrity of the data. With NoSQL, as systems are required to

expand with an influx of additional data, they scale out by adding more machines to the data


Page 21

cluster. With this scaling, NoSQL systems can process data at a faster speed than RDBMS, as

they are capable of spreading the workload of the processing over numerous machines

(Cattell, 2011).

Eventual Consistency

Eventual Consistency was pioneered by Amazon using the Dynamo database. The purpose of

its introduction was to ensure High Availability (HA) and scalability of the data. Ultimately,

data that is fetched for a query is not guaranteed to be up-to-date, but all updates to the data

are guaranteed to be propagated to all copies of the data on all nodes of the cluster eventually

(Cattell, 2011).

This ensures that databases are accessible to programs and individuals whom wish to read or

modify data, without the constraints of being locked out of a database or data field while the

data is currently being updated or read, as is the case with RDBMS databases models.

Low Latency

Latency is an element of the speed of a network. It refers to any number of delays that

typically occur in the processing of data (Mitchell, no date). In the case of NoSQL databases,

it means that queries can access the data and return answers more quickly than RDBMS

because the data is distributed across multiple nodes of a cluster, instead of one machine.

This results in a faster response time. Causes for high latency in traditional RDBMS model

databases include the seek time of hard disks (Mitchell, no date), the speed of the network

cables that run on the machines, and the bad programming of queries (Stevens, 2004)

(Souders, 2009).

NoSQL database models

Unlike RDBMS models, NoSQL data models are often inconsistent. For storage purposes,

NoSQL databases have a number of data model categories, which are listed below:

Key-value Stores

Databases that have this model use a single key-value index for all the data. These systems

provide persistence mechanisms as well as additional functions such as replication, locking,


Page 22

transactions and sorting. NoSQL databases such as Voldemort and Riak use Multi-Version

Concurrency Control (MVCC) for updates. They update data asynchronously, so they cannot

guarantee consistent data (Cattell, 2011).

Key-value store databases can support traditional SQL functionality, such as the ability to

delete, insert and lookup operations (Cattell, 2011).

Document Stores

This model supports more complex data than key-value stores. They can support secondary

indexes and multiple types of documents per database. A number of database models using

this include Amazon’s SimpleDB and CouchDB

Document Store databases provide a querying mechanism for the data they contain using

multiple attribute values and constraints (Cattell, 2011).

Extensible Record Stores

Influenced by Google’s Bigtable, Extensible Record Store databases consist of rows and

columns, which are scaled across multiple nodes. Rows are split across nodes by ‘sharding’

the primary key. This means that querying a range of values does not have to go to every

node. Columns are distributed over multiple nodes by using ‘column groups’. This allows the

database customer to specify which columns are best stored together, which has the added

advantage of being able to be queried faster, as all the most appropriate data for a query is

most likely close at hand: e.g., name and address (Cattell, 2011).

The most famous examples of an Extensible Record Store database available, save Google’s

proprietary Bigtable, are HBase and Cassandra. Additional databases that use the model are

Hypertable, sponsored by Baidu (Hypertable, 2011), and PNUT (Yahoo Research, 2011).


Page 23

Graph Databases

A graph database maintains one single structure – a graph (Rodrieguez, 2010). A graph is a

flexible data structure that allows for a more agile and rapid style of development (Neo4J,

2011).

A graph database has three main attributes:

1. Node – the location of the machine in which the data is stored

2. Relationship – this is a label given to the data item, which determines which data

in the same or other node that the original data is related too.

3. Property – this is the attribute of the data. (Neubauer, 2010)

The purpose of graph databases is to quickly determine the relationships between different

items of data. Examples of graph databases include the Neo4j database and Twitter’s

FlockDB, which is used to join up the tweets between those who post them and all of their

followers (Weil, 2010).

2.3.2 Hadoop

Hadoop/MapReduce

Hadoop is a distributed database model originally developed by Doug Cutting at Yahoo

(White, 2010, p. 9), using Google’s proprietary Bigtable database as a model (Apache, 2011).

Throughout its short history, developers have added components that allow Hadoop to

process the data that it collects more efficiently

Hadoop contains a number of components that allow the system to scale to large clusters of

machines, without impacting the overall integrity of the data stored on those machines. The

main component of Hadoop is MapReduce.

MapReduce is a framework for processing large datasets that are distributed across multiple

nodes/servers. The ‘map’ part of the framework takes the original inputted data and partitions

the data, distributing the original input to different nodes. The individual nodes can then, if

necessary, redistribute the data again to other sub-nodes. MapReduce then applies the map


Page 24

function in parallel to every item in the dataset, producing a list of pairs for each query

(White, 2010, p. 19). The ‘reduce’ part of the framework then collects all of the common key

values, sums them up, and returns a single output for the keys and a value(s). The reduce

function, in effect, removes duplication within the system, allowing queries to return results

more speedily (White, 2010, p. 19).

Hadoop is designed for distributed data, with a dataset split between multiple nodes, if

necessary. If MapReduce must query data that is located on multiple nodes, then the map

function will map all the data for the query that is located on a single node, and return the

result. It will do the same query on all nodes that the relevant data is located on. The reduce

function will then take all those map results and reduce them down to single values, again to

return the query result(s) (White, 2010, p. 31).

Both functions are oblivious to the size of the dataset that they are working on. As such, they

can remain the same irrespective of the size of the dataset, large or small. Additionally, if you

double the input data, the job will run twice as slow; however, if you double the size of the

cluster, a job will run as fast as the original one (White, 2010, p. 6).

HDFS

HDFS is the file system that allows Hadoop to distribute data across multiple

nodes/machines. HDFS stores data in blocks, similar in fashion other file systems. However,

while other file systems have small sized blocks, HDFS, by default has large size blocks. This

is to reduce the number of seeks that Hadoop must make in order to return a query, speeding

up the process (White, 2010, p. 43).

2.3.2.1 Components of Hadoop

HBase

Based on Google’s Bigtable, HBase was developed by Chad Walters and Jim Kellerman at

Powerset. The purpose of the development of HBase was to give Hadoop a means of storing

large quantities of fault-tolerant data. It can also sit on top of Amazon’s Simple Storage

Service (S3) (Wilson, 2009). HBase was developed from the ground up to allow databases to


Page 25

scale just by adding more nodes – machines – to the cluster that HBase/Hadoop is installed

on. As it does not support SQL, it can do what an RDBMS database cannot; host data on

sparsely populated tables, located on clusters made from commodity hardware (White, 2010,

p. 411). The structure of HBase is designed with a ‘master node’, which has control of any

number of ‘slave nodes’, called Region Servers. The master node is responsible for assigning

regions of the data to the region servers, as well as being responsible for the recovery of data

in the event of a region server failing (White, 2010, p. 413). In addition to this setup, HBase

is designed with fault tolerance built in – HBase, thanks to HDFS, creates three different

copies of the data spread across different data nodes (Dimitrov, 2010).

Hive

Hive is a scalable data processing platform developed by Jeff Hammerbacher at Facebook

(White, 2010, p. 365). The purpose of Hive is to allow individuals whom have strong SQL

skills to run queries on data that is stored in HDFS.

When querying the dataset, Hive first tries to convert SQL queries into MapReduce jobs, as

well as custom commands that allow it to target different partitions within the HDFS dataset,

allowing users to query specific data within the Hadoop cluster (White, 2010, p. 514). This

allows Hive to provide users with a traditional query model from older RDBMS

environments within the newer distributed NoSQL database environments.

2.3.3 Cassandra

Cassandra is a fault tolerant, decentralised database that can be scaled and distributed across

multiple nodes (Apache, 2011 - Lakshman, 2008). Developed by Avinash Lakshman at

Facebook (Lakshman, 2008), Cassandra is now an open source project run by the Apache

Foundation (Apache, 2011).

Initially designed to solve a search indexing problem, Cassandra was designed to scale to

very large sizes across multiple commodity servers. Additionally, the ability to have no single

point of failure was built into the system (Lakshman, 2008). Since Cassandra was designed to

scale across multiple servers, it had to overcome the possibility of failure at any given

location within each server, such as the possibility of a drive failure.


Page 26

To guard against such a possibility, Cassandra was developed with the following functions:

Replication

Cassandra replicates data across different nodes when written too. When data is

requested, the system accesses the closest node that contains the data. This ensures

that data stored using Cassandra maintains High-Availability (HA), one of the core

attributes of a NoSQL database. Once data is written to a server, a duplicate copy of

the data is then written to another node within the database (Lakshman, 2008).

Eventual Consistency

Cassandra uses BASE to determine the consistency of the database. In order for data

to be accessible to users, an individual whom is reading the data accesses it on one

node. At the same time, another individual can be making changes to another copy of

the data on another node. As the data is replicated, newer versions of the data are

sitting on one node, while older versions are still active on other nodes (Apache wiki,

2011).

Users of Cassandra can also determine the level of consistency, allowing writes to add

or edit data to a single copy of the data in a node, or, if possible, to write to all copies

of the data across all nodes (Apache wiki, 2011).

Scalability

Data that is stored on Cassandra is scalable across multiple machines. Such elasticity

is possible because Cassandra allows the adding of additional machines to the cluster

when required (Apache, 2011).


Page 27

2.4 The market for RDBMS’ and Non-Relational DBMS’

2.4.1 Introduction

This section is to give an overview of the current market for both relational databases and

newer non-relational databases. This document will investigate both traditional vendor

database offerings as well as the proliferation over the past few years of a number of

community developed open source database offerings.

The Literature Review for determining the current market for both traditional relational

databases and ‘future’ non-relational databases utilised a variety of sources, including

Internet search queries to find relevant research material, as well as utilising the University of

Dublin (DU) library facilities to access academic and commercial research to which DU has

access to.

2.4.2. RDBMS Market

Today, many executives want business to grow based on data-driven decisions. As such,

analytics of data has become a valuable tool in Business Intelligence (BI). Many of the top

performing companies use analytics to formulate future strategies and guide them on the

implementation of day-to-day operations (LaValle et al, 2010). However, organisations are

gaining more and more data without the means of extracting value from that data (LaValle,

Hopkins, et al). This has resulted in a requirement for the adoption by companies of

enterprise solutions that can give an overview of the data being generated, using Online

Analytical Processing (OLAP) databases.

The Database Management Systems market is split into two segments; OnLine Transaction

Processing (OLTP) and OLAP / Data Warehousing (DW). OLTP systems are characterised

by the RDBMS options available from vendors in the market will generally target either of

these two segments.

The OLTP market targets clients that require fast query processing, maintaining of data

integrity in multi-access environments and a business model that has data measured by the

number of transactions per second that the database can handle. In an OLTP model database,


Page 28

there is an emphasis on detailed current data, with the schema to store the data being the

entity model BCNF ( Datawarehouse4u, 2009).

OLAP databases are characterised by a low volume of transactions, and are primarily

designed for data warehousing databases. As such, they are particularly useful for data

mining; whereby applications access the data to give an overview of current trends, business

performance and informational advantage. As such, OLAP databases are increasingly seen as

important for making Business Intelligence (BI) decisions (Feinberg, Beyer, 2010)

2.4.2.1 Vendor Offerings

Within the enterprise database market, the industry is dominated by a few big corporations

which include Oracle, IBM, Microsoft, Sybase and Teradata. Many of the database offerings

from these firms operate in the Data Warehousing sector, which contains most of the market

for enterprise database management systems. While the big players will have comprehensive

database offerings for their clients, the market is currently being disrupted by new entrants

whom are targeting niche areas, either focusing on performance issues related to their

offerings, or single-point offerings (Feinberg, Bayer, 2010).

Oracle

According to Gartner, Oracle is currently the No. 1 vendor of RDBMS’ worldwide (Gartner

in Graham et al, 2010), with a 50% share of the market for the year 2010 (Trefis, 2011). They

are forecast to improve this figure to 60% by 2016, driven by their sales of the Exadata

hardware platform. Leveraging the use of the high-end Exadata servers in conjunction with

Oracles’ database software is estimated to result in more efficient and faster Online

Transaction Processing (Graham et al, 2010).

Currently, Oracle generates 86% of revenues from its database software portfolio, with 8%

from its hardware portfolio. The future strategy of the company is to have clients purchase

complete systems – hardware and software – thus leveraging the power of the Exadata system

to get the most out of Oracle’s database technology. The result will be an increase in Oracle’s

revenues and its market share (Crane et al, 2011).


Page 29

IBM

IBM is one of the main vendors in the market, and is the only vendor that offers to its clients

an Information Architecture (IA) that spans all systems, which includes OLTP, DW, and

retirement of data (Optim tapes) (Henschen, 2011a). IBM’s main offering in the RDBMS

market is the DB2 database. DB2 runs on a number of platforms, including Unix, Linux and

Windows OS. DB2 can also run on the z/OS platform, where it is used to deploy applications

for SOA, CRM, DW and operational BI.

IBM’s RDBMS solutions are ranked no.2 behind Oracle worldwide (Finkle, 2008), however,

they are slowly losing market share to Microsoft and Oracle due to uncompetitive pricing for

their database as well as greater functionality that can be found from rival offerings.

Recently, IBM acquired Netezza (Evans, 2011), a company who provide a DW appliance

called TwinFin to clients. TwinFin is a purpose-built appliance that integrates servers, storage

and database into a single managed system (Netezza, 2011a). The reason IBM acquired

Netezza is the expected increase in revenues that Netezza will generate from its portfolio

(Dignan, 2010), as well as a lack of overlap in the customer base between IBM’s current

client list and that of Netezza (Henschen, 2011b). Additionally, the acquisition fits in with

IBM’s overall business analytics strategy, as IBM has marked BI as the key driver for IT

infrastructure needs (Gartner, 2010).

Microsoft

SQL Server from Microsoft is a complete database platform designed for applications of

various sizes. It can be deployed on normal servers as well as the ‘cloud’, allowing user

clients to scale SQL Server to their respective needs. Purely a software player, Microsoft

requires hardware partners to deploy its database offerings (Mackie, 2011).

Microsoft, however, finds itself more under threat from low-cost or ‘free’ open source

alternatives such as MySQL and PostgreSQL due to operating primarily in the low-end mid-

market segment (Finkle, 2008). As such, if its clients are looking at alternative options, SQL

Server may not be competitively priced for Microsoft to compete with open source RDBMS.


Page 30

SAP/Sybase

Sybase, recently acquired by SAP, has three main business areas: OLTP using the Sybase

ASE database, Analytic Technology using Sybase IQ, and, interestingly, Mobile Technology

(Monash, 2010). This deal was required by SAP as it was coming under increasing pressure

due to Oracle’s recent acquisition of SUN Microsystems, which gave Oracle a stronger focus

on integrated products based around databases, middleware and applications (Yuhanna,

2010).

The deal between SAP and Sybase gives both companies a lot of synergies – SAP finally

acquires an enterprise-class database in the form of Sybase IQ, which SAP can now offer to

its hundreds of client companies a database with columnar store and advanced compression

capabilities (Yuhanna, 2010).

A differentiator from SAP peers now comes with the acquisition of Sybase in the form of a

mobile offering. Sybase has a number of mobile products for enterprises, including the

Sybase Unwired Platform and iAnywhere Mobile Office suite. These technologies allow

companies to connect mobile devices to a number of back-end data sources (Sybase, 2011).

SAP now has the ability to offer its applications embedded in Sybase mobile platforms, using

the synergy between the two to improve its competitive advantage and expand to other

markets (Yuhanna, 2010). Indeed, efforts are now being made to cement Sybase’s lead in this

segment of the market, with an initiative to make the Android OS platform enterprise ready.

This involves porting Afaria, Sybase’s mobile device management and security solution, to

the Android platform (Neil, 2011). With the growth of Android now reaching 30% of the

smartphone market share in the United States (Warren, 2011), the future growth for Sybase in

the mobile enterprise market looks strong.

Finally, although big in the database market in the early 1990’s (Greenbaum, 2010), Sybase

has been considered the fourth database vendor behind Oracle, Microsoft and IBM for the

past decade. Its main market for Sybase’ OLTP offering, Sybase ASE, has been the financial

services sector, with little penetration in other enterprise sectors. It is expected that SAP will

make Sybase ASE more cost effective, and make another push in this segment of the market,

maybe at the expense of the big three (Yuhanna, 2010).


Page 31

Teradata

Teradata is a database vendor specialising in data warehousing and analytical applications

(Prickett Morgan, 2010). During the last year, it was considered the best placed amongst its

peers as a market leader in Data Warehousing (Feinberg, Bayer, 2011). This will be a hard

position for competitors to dislodge as products in the DW market are considered difficult to

replace (Bylund, 2011). Amongst its clients are multinational corporations such as 3M and

PayPal (Teradata, 2011).

One of Teradata’s products, the Teradata parallel database, designed for DW and OLAP

functions, has an update and support revenue stream, as well as additional functions that

customers are willing to pay for (Prickett Morgan, 2010).

However, Teradata specialises in a single area of the database market – DW and analytics

(Prickett Morgan, 2010). As such it is exposed to any weakness that may occur within that

segment of the market. The company’s recent acquisition of Aprimo, an enterprise marketing

firm with a strong emphasis on Marketing Research Management (MRM) and Campaign

Management (CM). CM is considered by some as mission critical, as it allows marketers to

unlock the value of customer data to develop multi-channel communications. Such an

acquisition adds value to Teradata’s product portfolio, without competing with Teradata’s

current product range, allowing the company to diversify its offerings to clients and future

customers (Vittal, 2010).

EMC/Greenplum

Greenplum, a DW and Analytics firm acquired by EMC in 2010, is the foundation of EMC’s

Data Computing division. Greenplum specialises in DW in the ‘cloud’, through its Chorus

platform (Greenplum, 2011).

EMC’s strategy for gaining market share is releasing a free community version of their

database for testing, with the intent that they eventually purchase a commercial licence. It’s

recently released ‘free’ Community Edition database, a heavily customised version of

PostgreSQL, is targeted at companies and developers for whom Greenplum’s previous

offering was not useful for creating parallel databases for DW and Analytics (Prickett


Page 32

Morgan, 2011). The purpose of the release is to allow developers to build and test Massive

Parallel Processing (MPP) databases. If in the event that clients who develop these systems

wish to use the software in a commercial environment, then they will be required to purchase

a licence for the Greenplum Grade 4.0 database, EMC’s commercial DW offering

(Kanaracus, 2011).

It is hoped by EMC that customers wishing to have greater functionality with Greenplum’s

database will upgrade to the Greenplum Grade 4.0 database (Kanaracus, 2011).

2.4.3 Non-RDBMS Market

Open Source Databases

There are a number of open source community developed database solutions available on the

market today. However, due to these offerings generally being ‘free’, they don’t show up

high on the list of databases in use by revenues earned – total deployment of open source

databases can rival the total number of deployments from traditional vendors (Von Finck,

2009).

All RDBMS applications hold a consistency model that can be inflexible for certain

applications. The requirement for a record or table to be locked out from being viewed or

otherwise accessed while changes are being made slows down queries that are attempting to

generate results for end-users.

Additionally, due to atomicity and consistency, not all RDBMS applications are scalable to

the requirements of organisations that hold large quantities of data, such as Google and

Facebook.

With databases now employed that have tables of sizes in excess of 10 TB, the ability to

query all that data will require speed and processing power that cannot be achieved to the

requirements of user companies by traditional RDBMS offerings. Newer non-relational

database offerings designed to meet these new requirements usually come in two options;

MPP systems and Column-Store databases (Henschen, 2010).


Page 33

With the introduction of the Bigtable Distributed Storage System on top of the Google File

System (GFS) in 2006 (Chang, et al, 2006), Google has demonstrated that non-relational

databases can be scalable over multiple machines. Due to Bigtable’s proprietary nature

however, efforts have been made over the past five years to develop open source versions of

Google’s software, resulting in the arrival of the Apache Foundation’s Hadoop, initially

developed by Yahoo (Bryant and Kwan, 2008). A number of companies have now utilised

Hadoop and associated software to allow themselves to scale their database offerings to their

own requirements.

The growth of Hadoop can be inferred by unusual avenues. From 2007 through to early-mid

2009, IT requirements for expertise in Hadoop or MapReduce within the London area was .4

of 1% of the jobs market. By January 2011, the figure had grown to 1.2%, a 300% increase in

the requirement for expertise within 2 years (IT Jobs Watch, 2011). Additionally, there was a

49% increase in Hadoop job postings in the United States from 2008 to 2009, with most of

the job offerings being in California (Lorica, 2009).

However, due to the lack of suitably qualified engineers for Hadoop and HBase within the

industry at present, development projects at a number of companies have been affected due to

the lack of staff. Within Silicon Valley, Google and Facebook are two companies that can

afford to remunerate staff competitively due to their large sources of revenue. This has

resulted in Cloudera, the Start-up cloud database company, being unable to offer top

engineers remuneration at similar levels to their competitors. Cloudera have had to be

imaginative in relation to its remuneration to staff. This includes setting up offices within

downtown San Francisco, with the intention that staff would prefer to work in that location

than Palo Alto or Mountain View, both 30 miles from the centre of San Francisco (Metz,

2011a).

Such constraints will result in a lack of projects for new NoSQL databases until an adequate

supply of qualified engineers become available, slowing growth for development and

adoption of this new technology for the foreseeable future.


Page 34

Cassandra

Cassandra is a distributed, column family database, developed at Facebook to solve an Inbox

Search problem (Lakshman, 2008). It is now an open sourced project from the Apache

Foundation (Apache, 2011).

In addition to Facebook, additional users of the Cassandra database include the social news

website Digg (Higginbotham, 2010), who decided to switch from MySQL to Cassandra due

to scalability issues with MySQL. The rational behind the move was the decentralised nature

of Cassandra and the fact that it has no single point of failure (Kerner, 2010). Unfortunately,

the changeover to Cassandra was not run smoothly, resulting in Digg having to revert to

MySQL to ensure data integrity, and allow its services to be available to its clients. The

episode highlighted the pitfalls of switching from one architecture framework to another

(Woods, 2010).

Taking advantage of Cassandra’s introduction to the market, is Datastax – formerly Riptano

(DBMS2, 2011), a start-up founded by the Cassandra project’s chair, Jonathan Ellis. The

purpose of Datastax is to take commercial advantage of Cassandra, by selling expertise and

technical support in Cassandra (Kerner, 2010), following the examples of Red Hat (Linux)

and Cloudera (Cloud Computing) (Subramanian, 2010).

HBase

HBase is a non-relational database built on top of the Hadoop framework, using the Hadoop

Distributed File System (HDFS). Originally developed out of a need to process large amounts

of data, HBase is now a top-level Apache Foundation project (Zawodny, 2007).

Due to HBases’ ability to scale to large sizes, the database has received attention within IT as

a platform that can meet various companies’ requirements. Recent corporate announcements

about their deployment of HBase, has increased the marketplace viability of HBase as a

NoSQL database option (Metz, 2011b). These include both Facebook and Yahoo, 2

companies with large repositories of data.

Facebook announced a new messaging platform, in which email, text messages and Instant

Messages (IM), as well as Facebooks’ own messaging system, would be integrated together

(Metz, 2010). Facebook experimented with a number of database offerings, including its own

Cassandra database to see if it could handle the new system. Additionally, they excluded


Page 35

MySQL due to scalability issues. Eventually, they chose HBase, due to its consistency, as

well as ability to scale across multiple machines (Muthukkaruppan, 2010).

HBase was deployed by Yahoo to handle its news aggregation algorithm. The purpose of the

new system is to data-mine content in order to optimise what the viewer sees on Yahoo’s web

portal. In order for Yahoo to deploy to the website front page the most relevant news stories

that people are viewing at any given moment in time, their requirement for the system was a

database that could quickly query in real-time the most relevant items that people are

interested in based on the number of clicks that story receives. Deployment of this new

system has resulted in an increase in traffic to the Yahoo web portal, and subsequently

resulted in an increase in revenues (Metz, 2008).


Page 36

2.5 Case Studies

2.5.1 Case Study 1- Utility companies and the data management challenge

Introduction

Utility companies are known to be one of the most conservative of enterprises when it comes

to investing in technology (Fink, 2010; Fehrenbacher, 2010). There are many reasons for why

this might be so; security of supply, regulatory compliance and financial austerity together

with a lack of business drivers often leaves the risk averse utility threading water when it

comes to IT investment (Tony Giroti, CEO Bridge Energy, 2011). However, things have been

changing over the last few years. According to recent research by Lux, utilities (mainly

power and water) will invest up to $34 billion in technology by the year 2020 (St. John,

2011). The reason arises from Smart Grid projects mainly and the growing avalanche of

associated data which utilities will need to manage (St. John, 2011). For utilities, the business

drivers required to justify investment in the kind of technology which enables integration of

data across key business units have only recently emerged. Real-time applications just

weren’t necessary before now (Giroti, 2011).

Utilities

History has shown how utilities are by and large reactionary when it comes to new ideas. For

example, a snapshot of energy utilities related articles in the Pro Quest database (available

through the TCD Library’s online resources) at various times over the last few decades shows

flurries of activity around key moments of change in the industry. Cyclical changes from

regulation to de-regulation of the energy sector in the early 1990s, begun in the US, kick-

started reactionary strategy changes within the energy industry. Ireland followed the pattern

with the Electricity Regulation Act of 1999 a program which is nearing completion. Fifty six

articles on related subjects between 1992 and 1994 in contrast to just eighteen in following

six years to the year 2000 (Pro Quest database) would seem to support this assertion.

In the last decade or so innovation for utilities centred around the technology enabling Smart

Grid and again an upsurge in articles on this subject stands out in a normally ‘steady state’


Page 37

sector. More recently the pressures of a diminishing supply and subsequent higher prices of

raw material for energy production have propagated a sustainability drive.

Compliance however has been a steady influence on energy utilities. What makes the Smart

Grid attractive is the way it forces efficiency throughout the energy supply chain from

generation to distribution resulting in less CO2 emissions – a major deliverable of the Kyoto

agreement. Related to this has been the drive towards sustainable energy generation and

supply. Vice President of Technology at Cobb Energy, Bob Arnett sums it up:

“In today’s world, where utilities are focused on environmental concerns, resource

constraints, and intelligent grids, it is sometimes hard to remember that in the mid-

Nineties, the word of the day was ‘deregulation’.”

(Arnett, 2011)

This case study looks at utility companies in the context of these three key drivers:

Regulation/Deregulation; Smart Grid and Sustainability. The case is stated in general terms

initially but quickly moves to more specific Smart Grid applications in electricity supply

companies, focussing in one Irish energy company’s use of databases in its implementation of

Smart Grid applications. As the ESB’s (Electricity Supply Board) Tom Geraghty said of

Smart Metering in a recent interview with Silicon Republic:

“How you get data back from the electronic metre to a utility central point where it is

aggregated and the bill is sent out to simply allowing people to top up their metre at

home as if it were a mobile phone shows you the complexity that lies ahead. There are

many imaginative options emerging and the opportunities are endless,”

(in Kennedy, 2011)

One estimation from Lux research puts the increase of data coming from the Smart Grid at

900% by 2020 (St. John, 2011). Tony Giroti puts this in more tangible terms- 1 million smart

meters passing data every 15 min equates to 30 TB of data per year to be handled, stored and

harvested (Giroti, 2011). This figure doesn’t include the real time data flowing through the

system as part of the self-healing attribute of Smart Grids.


Page 38

The problem can be placed within the wider question asked in this dissertation, that is, what

is the future of the traditional RDBMS in the enterprise? To this end, this case study

predicates that the general feeling towards newer database management solutions such as

open source and NoSQL is that while they are attractive for certain non-core applications,

they are not yet up to the task of the more serious mission critical functions of control

systems, financial transactions and customer management within enterprises. This study

investigates the problem in the context of traditionally risk averse utility companies and

questions if new business drivers (of which the Smart Grid is key) are forcing a rethink on

this issue.

A public utility company is an enterprise which provides key services to the public most

typically electricity, gas, water, and transportation. They may be state or private owned. They

may operate in a regulated, deregulated or even semi-regulated market (Legal Dictionary).

The energy sector in Ireland is currently under going dramatic change. The two largest

energy companies in Ireland, the Electricity Supply Board (ESB) and Bord Gáis are

commercially run enterprises and are both majority owned by the state. Both companies have

recently entered into each others markets as a result of the state’s requirement (and driven by

the EU) to open up the energy market in an attempt to improve competitiveness the sector for

the benefit of consumers (Irish Government White Paper, 2007).

One result of this restructuring of the sector is that the separate electricity and gas markets

have been combined and the sector is now generally referred to as the energy market. The

functions carried out by utility companies differ according to the services they provide.

Energy suppliers are similar in the functions they carry out such as generation, transmission

and distribution of energy. Water utilities in other countries have moved towards a revenue

generating model for water supply and Ireland rightly or wrongly may soon follow suit.

Each core function contains a number of supporting IT applications. Each of these in turn is

supported by a suitable data management system. Some of the major solutions used in energy

utilities include: Geographical Information System (GIS); Meter Data Management (MDM);

Customer Information System (CIS); Distribution Management System (DMS); Supervisory

Control and Data Acquisition (SCADA); and Outage Management System (OMS). Figure 2.3

shows where some of these systems fit into the overall network.


Page 39

Each of these systems provides support the specific needs of the different business functions,

such as, supply, generation, distribution, trading, and operations. As such they may or may

not be integrated. In relation to meter data management (MDM) Giroti again states the

problem succinctly in his paper entitled “You’ve Got the Meter Data – Now What?” (2011),

where he gives two options:

Have a proactive strategy for integrating and managing data coming from the Grid, or...

Be reactive in response to problems as they appear at the risk of being left behind by

competitors adopting the former strategy.

Smart Grid - The ESB case

The European Technology Platform definition of smart grids is -

“electricity networks that can intelligently integrate the behaviour and actions of all users

connected to it - generators, consumers and those that do both – in order to efficiently deliver

sustainable, economic and secure electricity supplies” (Smart Grids: European Technology

Platform, 2010)

Successful smart grid implementation depends on how enterprises utilise information systems

in managing the torrent of data heading their way. This issue puts data management systems

right back in the foreground of the IT game.

The ESB plans to invest up to €11 billion in sustainable projects including a Smart Grid

(Strategy Framework 2020). The ESB began a pilot project for advanced metering in 2007.

Advanced meters occupy what is termed the head end of the smart grid. They reside on

customer premises or at the company’s own locations typically at the edge distribution

network. The ESB has to date installed 6,500 smart meters. The estimated total installations

required for full implementation is over two million. The data consists of messages to and

from a central management system called a meter data management system (MDM). The

message can be meter data relating to load readings, voltage and temperature measurements,

outages, faults and other events.

The ESB’s existing data management platforms includes solutions from Oracle, IBM and

Microsoft. Currently no open source or NoSQL solutions exist in any official way in the

company. A preliminary evaluation of the open source database solution MySQL was carried


Page 40

out by the IT department in 2010 but no decision on implementation has been made as yet.

MySQL is now under the roof of the Oracle house following its acquisition of Sun

Microsystems in 2010 (Lohr, 2009).

Image source: http://www.consumerenergyreport.com/wpcontent/uploads/2010/04/smartgrid.jpg

Figure 2.2 – Overview of a generic Smart Grid


Page 41

(Image source: EPRI)

Figure 2.3 - ESB proposed implementation of Advanced Metering (Key area of interest is circled)

The Data Volume Problem

A traditional electricity grid is made up of electro-mechanical components that link electricity

generation, transmission and distribution to consumers. A smart grid builds on advanced

digital SCADA devices involving two-way communication of data of interest to utilities,

consumers and government (Financial Times, Nov 2010).

Figures for how much data will flow vary depending on the implementation of smart grid.

Estimates from the ESB’s trials involving 6,500 meters show a substantial increase in the

amount of data required to be stored and analysed at the back end.

Utilities it seems are not immune to ‘Big Data’. Tony Giroti is qualified to comment on the

issue. He is one of only 13 elected members of Gridwise Architecture Council formed by the

US Department of Energy for the purpose of articulating the way forward for intelligent

energy systems.

In his article for the e-magazine Electric Energy Online “You’ve Got the Meter Data – Now

What?”, (2011), Giroti states the data volume problem as such:


Page 42

Figure 2.4 – Smart Meters transaction rate

Girotti foresees the storage and processing concerns associated with this volume of data.

Figure 2.5 – Smart Meters data size

Processing of this data also presents a challenge to system architects. Gathering of data from

a million smart-meters at 15-minute intervals as per the example above equates to 1,111

transactions per second, or 90 million transactions per day. The problem is further

compounded by the critical requirement of the system to analyse network event transactions

in real-time in responding to fluctuations in demand and fault response (Giroti, 2011).

One limitation of Girotti’s claim is that there is no indication in the article of how the one

kilobyte per transaction figure is calculated. This is an important factor for vendors of back

end processing running off relational databases. The lower this number is the better. Some

systems rely on filtering out less important data at the source, that is, at the meter itself rather

than storing superfluous data at the back end. For example, meter location information does

not change and can be sent only once. Even at a conservative data size of 128 bytes per

1 Million

Smart

Meters

Hourly Collections of

data =>

3.6Gigabytes of data

per day to be stored,

analysed and backed

up

1Kb per transaction

per meter = 1.1Mbs

1 Million

Smart

Meters

1 read every 15

mins

1 Million meter reads

15 mins x 60 secs

1,111

Transactions

per sec


Page 43

transaction for basic household usage data only at 15 min intervals, that’s 1.2 Megabytes of

data per meter per day to be stored, backed up and processed.

(Image source: Accenture, 2010)

Figure 2.6 - Sources of Smart Grid data with time dependencies

Figure 3 shows the different types of data involved. At one end you have critical and time

dependant event data. Some of data at this end will have very low latency measured in

milliseconds – the kind of times involved for the safe operation of self-healing networks. At

the top end there is the data for business intelligence. Processing of this data does not need to

be immediate. The top end gets interesting, however, when the business tries to wade through

their data warehouse, clustering data to form information and using the information for

knowledge and hopefully wisdom. Then there is the middle layer meter data coming in 15


Page 44

minute or half hourly intervals. Efficient processing of this data is critical if utilities want to

offer real time billing to their customers.

Further sources of data can be found at the opposite end of the system, the home and car.

Successful interoperability between domestic devices, electric vehicles and supplier

equipment (data collectors) will create the intelligent home. Many commentators are focusing

on open source as the most viable platform for this development (Fehrenbacher, 2009;

Rosenberg, 2010).

So in the end, the business case for the return of investment in smart networks and ultimately

competitiveness depends on the how well all of this data is gathered, stored, processed and

analysed.

Tom Geraghty, IT governance and strategy manager for the ESB explained, “In terms of

where the ESB is and where Ireland is in terms of the smart-grid agenda, we have just

completed a year-long technical trial of smart metering and a decision is due mid-year from

the energy regulator on how we should proceed. Emphasising the IT challenges ahead, he

added that “data collection, storage, transfer and billing are the key issues for utility firms”.

(in Kennedy, 2011)

How one utility company is meeting the data volume challenge

Dave Rosenberg of CNET interviewed Ritchie Carroll and Josh Patterson of Tennessee

Valley Authority (TVA) about their use of the open source solution Hadoop in addressing the

data volume problem (Rosenberg, 2009). TVA uses devices called Super Phasor Data

Concentrators (SuperPDC) to collect data on the health of their electricity network. TVA

expects the stored data from all their PDCs to grow to half a petabyte in the next few years.

TVA’s Josh Patterson says that “data is collected directly from field devices at 30 times per

second. This data is then time-aligned and processed in real-time….all data gets captured into

a binary data file as time-series data for mass processing by Hadoop.”

When Rosenberg asked why TVA chose Hadoop over more mature solutions, Paterson

responded:

“We considered several technologies including SAN, NAS devices, and RDBM systems.

Hadoop gave us a commodity based hardware solution that offered superior reliability at a


Page 45

minimal cost using HDFS, but it also had the added processing benefit using Map Reduce

over large scale data for fast analysis”

TVA’s Ritchie Carroll adds that Hadoop techniques already developed allows for faster

processing of ‘Big Data’ (Rosenberg, 2009).

What is the ESB doing?

ESB engineers working on their Advanced Metering project (the first step towards the Smart

Grid) looked for organisations who had implemented projects of similar scale. They found

that the US utility company ONCOR Delivery Electric had planned for data from 3.5 million

meters over a two year period. This was the kind of scale comparable to the ESB’s project at

full implementation. Access to this information was valuable in assessing the right type of

data management system required.

As the ESB already uses well established data management solutions from Oracle and IBM

an assessment of those vendors Meter Data Management (MDM) solutions seemed a good

place to start.

Oracle Utilities Meter Data Management is described by Oracle as an ‘off the shelf’ solution

for managing the influx of data from Advanced Meter Infrastructure (Oracle).

Oracle’s strengthened its position in the utilities area through its acquisition of Lodestar

Corporation in 2007. ESB Customer Supply business already uses Lodestar products for

demand forecasting (PR Newswire, 2011). Lodestar Customer Choice Suite includes the

Oracle 10g database (Oracle, 2007).

Smart DTS is AMT Sybex’s MDM solution for the UK and Ireland utility sector. SMART

DTS claims unrivalled performance for processing large volumes of data (AMT Sybex)

IBM’s Informix TimeSeries Datablade is used as the enterprise scale time-series RDBMS for

meter data loads. In 2005 AMTSybex provided products and services to the ESB market

opening Project (MOIP), (AMTSybex Case Study).

The ESB Power Generation business uses OSISoft’s PI solution for managing real-time

operational information. The associated database is Microsoft SQL Server.


Page 46

ESB also looked at Aclara Software’s Meter Data Management System. It uses a star schema

for time series data and a “wide storage” model to reduce storage needs and improve

processing.

The star schema is common in RDBMS data warehouse design. The schema is organised

around a central fact table joined by foreign keys to dimension tables (Aclara Software Inc.,

2008)

In “wide storage” the meter interval data is stored in a single daily record (row). The

attributes are spread over many columns. In more traditional “tall storage” database

architecture, 15 min interval data for one day would be stored in 96 separate rows. Aclara

have estimated that data will take up about 10% of the space “tall storage” would require for

the same data (Aclara Software Inc., 2008). Wide table database was designed to better

handle sparse data sets where many attributes may be null, such as user provided information

from the web (Chu et al, 2007).

Conclusion

The ESB has not yet concluded trials and no decision on its preferred MDM system is

available at the time of writing. It is the opinion of this writer that the ESB will follow a

similar line advised by Keith Broad (Director of Information Technology at Bluewater Power

Distribution Corporation, Ontario, Canada), to engage a trusted partner that will be around for

a long time; who will use their strong capabilities to develop product evolutions but can also

implement at local level (Broad, 2011)

Another view comes from Forrester’s Jeffrey Hammond. He says that enterprises are now

less afraid of open source solutions. “For large companies its tech OS savvy people who just

want to solve a problem without the burden of procurement and licensing on their backs and

they have the time...for smaller co’s its just money” (Hammond, 2009).

Tennessee Valley Authority is one of the rare utilities to embrace open source and would

seem to be the exception to the rule ( different research methods, such as interviews and

surveys, may reveal other utility companies that are using open source solutions for Smart

Network applications, however based on inductive research so far the number would appear

to be low).


Page 47

For utilities that are regulated to any extent, decision making on investment is tightly

controlled. Network infrastructure upgrades can be slower and more cautious. As a result

large well established vendors are more attractive (Fehrenbacher, 2010).

Things may be changing, however. As initiatives to reduce costs while investing in smart

technologies dove-tail with the availability and uptake of technically robust and

commercially sound new offerings, open source and non-relational database management

platforms are making inroads into utility companies. It may still be the case of careful

matching of systems with functional requirements but open source it seems is becoming part

of the business case for technology savvy utilities. In reality, for utilities anyway, this might

happen by proxy as their traditional and trusted vendors acquire the more robust of open

source solutions.

Andy Roehr of CapGemini sums the importance of new technologies and methods for

managing data:

“Just storing the data is also only the first step in gleaning benefits from smart meters

and smart grids…the industry as a whole has a challenge in front of it as we learn how

to use the data. Right now, we're trying to figure out what the right technologies are

and appropriate data collection intervals. People are going to learn as they go

(Pariseau, 2009).

2.5.2 Case Study 2 - Social Networks – The migration to Non-SQL database models

The most active adopters of ‘Big Data’ NoSQL database technologies have been Social

Network websites, as they are companies that generate large amounts of data from their users.

As such, for this case study, we will investigate the adoption of NoSQL databases by a couple

of prominent social networking sites, and why they chose these database models over

traditional RDBMS’.


Page 48

Facebook Messages - Choosing the scalability of HBase over the constraints MySQL

Facebook Messages is a new application from Facebook that integrates all of a users’ emails,

text messages (SMS) and instant messages (IM), as well as messages sent using Facebooks

own messaging system, between himself/herself and another person into one conversation

(Muthukkaruppan, 2010). If someone wanted to reply or forward on a message, he/she can

select from which method of communication they wish that message to be received –

Facebook, e-mail, IM or SMS (Seligstein, 2011).

Technology behind Facebook Messages

During the initial development of Messages, the engineering team realised that they would

require a large and robust storage platform in which to store all the messages that would be

generated. However, the team first needed to know what the requirements for the system

were. To get an idea of how the new Messages system was likely to be used, they monitored

the usage of their current users in relation to responding to message ‘chats’ on their own

profiles. This was from a total pool of 300 million users sending 120 billion messages a

month. After careful study, the engineering team realised that two patterns emerged:

● “A short set of temporal data that tends to be volatile”.

● “An ever-growing set of data that rarely gets accessed”. (Muthukkaruppan, 2010)

The team then proceeded to evaluate different database technologies to determine which

system would be most suitable for the new Messages service. Depending on the outcome of

this evaluation, Facebook would adopt one of the options, or possibly build their own (Hoff,

Nov 2010).

Facebook already have large clusters of MySQL servers (Cohen, Nov 2010), and were the

original developers of the Apache Foundations’ Cassandra Database (Lakshman, Malik,

2008).

The team tested the system on clusters of MySQL databases to determine whether or not the

new system would scale to the necessary size in order for it to work for the new service. They

dismissed MySQL as an option because, when dealing with a large amount of data, indexes

take a long time to update and all statistics of the data rarely get updated, if ever (Peschka,

2010). As such, the performance of the system suffered (Muthukkaruppan, 2010).


Page 49

Having tested the system on Cassandra, they realised that Cassandra’s eventual-consistency

model was difficult pattern to reconcile with Messages new infrastructure (Muthukkaruppan,

2010). Cassandra’s consistency model means that old versions of a post or message will still

be in the system after the updated post has been written (Apache, 2010a). With the Instant

Messaging component of Messages being required to return to the user immediate results, the

eventual consistency model of Cassandra could result in users reading older messages that are

still in the system, making Cassandra unacceptable for Messages.

They realised that HBase was ideal for Messages. To start with, it has a simpler consistency

model than Cassandra (Muthukkaruppan, 2010). HBase is built on top of Hadoop/HDFS,

which has a replication model called Replication Pipelining (Peschka, 2010). Replication

Pipelining requires data that is received in one copy of the file is immediately sent onto the

second copy of the file that is located on another node, and so on (Apache, 2010b). This

ensures that the data is consistent across the whole system, and any user accessing the data

has access to the most updated information, irrespective of which copy of the data is

accessed.

The strength of the consistency model of HDFS/Hadoop was required if Messages was to

have the real-time updates users expect of the system.

Twitter - The use of NoSQL databases at Twitter

Currently, Twitter generates 12 terabytes of data per day, doubling per annum (Weil, 2010).

Originally, in order to store the large amounts of data that Twitter was generating, they were

using a logging system called Syslog-ng. Syslog-ng eventually stopped scaling, resulting in

the system dropping data. In effect, this resulted in the data being lost forever (Weil, 2010).

Having tried to resolve this problem, Twitter used a new system called Scribe (Weil, 2010),

developed and open sourced by Facebook (Hoff, 2008). Scribe works well with a distributed

system, with only a minimal loss of data under certain circumstances such as timeout issues

(Hoff, 2008). Scribe solved Twitters initial problem with regards to logging and saving all the

data which it generated. The system worked so well that Twitter now knew more about what

was happening across its entire technological ecosystem (Weil, 2010).


Page 50

With this increased amount of stored data, Twitter realized that they needed to be able to use

it in a productive manner. To analyse the data, Twitter turned to Hadoop and MapReduce.

Hadoop was able to distribute the data over large clusters of machines, and then using

MapReduce, Twitter was able to compress the data by getting rid of duplicates of un-

necessary data, e.g. User_ID.

Attempts to analyse the data using MySQL would require joining a table of User_ID’s with a

table of Tweets. The resulting query would have to read through both tables, involving

millions of users (current count: 130 million) and all of the tweets generated by those users

(billions!). The length of time required to generate the result(s) would be too slow to make

MySQL useful for such a query (Weil, 2010).

With Hadoop having been designed with parallel computing in mind, it is possible to query

all of the tweets across the Twitter infrastructure in minutes. Additionally, with Hadoop’s

emphasis on scalability, adding more machines to the cluster helps speed up the process.

However, Hadoop will also read data that is not required for the query. Attempting to solve

this problem, involves writing specialised queries into the Hadoop infrastructure using Java,

the language upon which Hadoop is scripted. This resulted in queries that were not optimised,

slowing the system down.

To overcome this problem, Twitter use a high level language designed for use with extracting

data results from Hadoop called PIG (Anand, 2008). Built on top of Hadoop, PIG can query

data with 5% of the code of MapReduce, in 5% of the time (Weil, 2010). The ease of use of

PIG also allows more individuals within Twitter to customise queries to their specific needs.

This has helped different departments to gain the necessary answers they require to improve

the performance and productivity of their departments.

Using HBase, Twitter started to build products within the Twitter infrastructure. Taking for

example Twitters People Search utility, the old system would scan though the User_ID table

for the relevant name, but in an offline process on a single node. The system was prone to

failure due to the length of time it would take to process the query, in addition to listing

irrelevant results.

Because People Search is built in HBase on top of Hadoop, Twitter is able to scale People

Search across multiple machines. This not only improves the overall performance of People

Search, it also gives it built in redundancy, as Hadoop writes multiple copies of the file across


Page 51

the cluster to prevent a loss of data (Weil, 2010). Additionally, People Soft is mutable,

allowing users of Twitter to be ‘findable’ by People Search, even if they have changed their

user names (Weil, 2010). This contrasts with the problem of structured data discussed in the

section on RDBMS.

Twitter performs a large number of queries based on degrees of separation (Weil, 2010).

Taking for example that an individual sends a tweet to one individual, then everyone of the

senders followers must be informed that the tweet was sent, in addition to all of the followers

of the individual who received the tweet – two sets of updates must be performed. Twitter

originally used MySQL in 3rd normal form to ensure that everyone who was meant to get the

tweet update received it. Unfortunately, as the number of followers for an individual grew,

MySQL ran out of RAM when the indices overflowed, resulting in updates not getting to all

followers. Even when Twitter de-normalised the table into normalised lists, it became

inefficient if an individual had too many followers (Weil, 2010). It resulted in data

consistency challenges, especially if a deletion occurred, as the system would then have to re-

write the whole update.

To overcome these problems, Twitter built an architecture on top of MySQL called FlockDB.

FlockDB, developed by Twitter (Pointer et al, 2010), is a database that stores graph data, and

is optimised to read and write fast to adjacency lists. In conjunction with Gizzard, a

middleware networking application that handles and manipulates queries between the back-

end data store and the database, FlockDB stores User_IDs and Tweet_IDs as a set of integers

on these lists which are used for sorting (Pointer et al, 2010). These integers take the place of

User_IDs and Tweet_IDs, with the most recent first. By reducing the amount of data that the

system has to query as a result of turning IDs into integers, FlockDB can query any page of

indexed data very fast, allowing tweets to be updated quicker than if User_IDs or Tweet_IDs

were used as the primary source of query data.


Page 52

Chapter Three - Research Methodology

3.1 Introduction

“Somewhere, something incredible is waiting to be known.”

Dr. Carl Sagan quotes (American Astronomer, Writer and Scientist, 1934-

1996)

“Knowledge is the death of research.”

(Nernst's motto) Nernst, Hermann Walther, in C. C. Gillispie (ed.), The

Dictionary of Scientific Biography (1981), Supplement, Vol. 15, 450.

“Inquiry is fatal to certainty”

Will Durant (American Writer and Historian, 1885-1981)

The quotes above perhaps illustrate both the wonder and the frustration of conducting

research. In the beginning there is an idea that before investigation may be imagined to be

wonderfully true. The following statement will serve to illustrate the point – ‘My cat must be

cleverer than all other cats because I have seen him do things that other cats cannot do.’

Science demands that such a statement with all its assumptions is challenged (if deemed

worthy of investigation in the first place); what Karl Popper calls ‘falsifiability’ (cited in

Burns, 2000 p. 7). Research is the way forward to test the hypotheses contained in the

statement. Proper research must test all findings on many fronts; for example, assumptions

based on culture, class, gender, language, values and history, must be challenged before facts

can be accepted as true. In this respect Nernst’s motto is only half the story. As knowledge is

acquired it must be retested under various contexts, and hence, Durant’s reverse of Nernst’s

motto is also true. The only path towards certainty is further inquiry.


Page 53

This chapter attempts to show how the research question at the heart of this dissertation was

addressed and why it was placed in a particular context. It also discusses the reasoning behind

the selection of a research methodology and the specific methods used.

Chapter 1 introduced the research question and Chapter 2 discussed this question in detail by

reviewing the current literature, analysing the findings of the research and finally using case

studies to support our research. The Chapter presents the following:

• The strategy adopted for researching the question in the form of a framework.

• Explanation of why we chose a particular research strategy.

• The methodology chosen to best fit that strategy.

• The research methods used.

• How the overall strategy fits with current research practices.

• Strengths and weaknesses of the research strategy.

3.2 The strategy adopted for researching the question

A definition of research from Robert B. Burns (not the poet) states that “Research is a

systematic investigation to find answers to a problem.” (Burns, 2000, p. 3). The Cambridge

Dictionary definition is a little closer to the work undertaken for this dissertation: “Research -

a detailed study of a subject, especially in order to discover (new) information or reach a

(new) understanding” (Cambridge, 2008).

Research theory tends to centre on two contrasting approaches – the scientific/positivist

approach and the qualitative/ interpretive approach. The starting positions of both approaches

(albeit simplified) can be summed up in the two quotes below:

"There's no such thing as qualitative data. Everything is either 1 or 0"

(Fred Kerlinger in Miles & Huberman, 1994, p. 40)


Page 54

"All research ultimately has a qualitative grounding"

(Donald Campbell in Miles & Huberman, 1994, p. 40)

Ideally all researchers would like the data to be easily validated. Numerical data, the basis of

statistical analysis should under correct control conditions produce valid answers to a

hypothesis (that is, either true or false), (Burns, 2000, pp. 8-9).

On the other hand, there can be an inherent weakness in the way humans measure things. A

controlled environment in which an event or behaviour is observed is in a sense a contrived

environment and present an exact replica of the same conditions in which the same

phenomenon occurs naturally. The Hawthorne effect (of Elton Mayo’s experiments in the

Western Electric Hawthorne Works, Illinois 1927-1932) is sometimes cited as an example of

this problem (BioMed Central, 2011).

Quantative/Scientific/Positivist Qualitative/Interpretivist

Objective/Rational Subjective/Argumentative

Deductive Inductive

Observe and Measure Observe

Closed/Controlled Open/Natural

Statistical Descriptive/Interpretive

Behaviourist Cognitive

Logical argument - ‘This is so’ Hypotheses - ‘This seems to be so’

Table 3.1 - Key concepts in Qualitative and Quantitative research methodologies

Table 3.1 above shows a table of some of the keywords associated with each approach. It has

been comprised from various sources (Burns, 2000; Web Center for Social Research

Methods; and EJBRM, 2003)


Page 55

The last words in the table are perhaps pertinent to understanding the place of each approach

in the way arguments are made. The logical method of verifying the validity of a premise

leads to a valid conclusion. However, without the cyclical testing and retesting of hypotheses

the validity of a premise may be compromised. Take for example the following syllogism:

All the cats I have observed have four legs.

Therefore, all cats have four legs.

The above is a weak argument. The premise is valid but the conclusion does not take account

of cats that have unfortunately lost limbs and are alive. In a more relevant and less abstract

way the selection of information and the testing of assumptions have been key features of the

research methods for this dissertation.

The qualitative approach, especially in the initial stages of research can direct focus to where

quantitative study efforts should be made. This iterative approach of using the appropriate

methodology according to the required goal ensures better focus and therefore less waste of

research resources.

This last sentence of the paragraph above is itself a syllogism (enthymeme) (Rapp, 2010). It

contains an initial premise, followed by its inferred premise followed by its conclusion, and

hence requires further testing. It is not practical given the scope of this dissertation to stop

and test every premise of an argument to its end (even if it were possible given that the data is

not quantitative in nature). The philosopher and author Bertrand Russell recognises the

problem of syllogisms in general when saying that all that we can really expect to understand

is how words are used (Russell, 1995, p. 208). We, as researchers accept that limitation and

present our research as contained within the framework set out below. What is required is a

solid foundation for which to build on and enhance the knowledge of the subject matter

already existing. The following sections deal with the basis for the research that hopefully has

kept the research on the correct track while bounded by its limitations.

3.3 A Theoretical Framework

The scope of this chapter does not merit dwelling too much on the scientific approach

because the research question and subject of this dissertation is better serviced by a

qualitative analysis approach. The presence of the word ‘future’ in the question alone forces a


Page 56

hypothetical and inductive method. At the end of this chapter, the focus will return to

continuing this discussion of research theory briefly, relative to our experience of researching

for this dissertation. Particular attention is given to the future study regarding the question of

whether or not the availability of a large quantity of information on the Internet together with

the advancements in analytical tools and skills, are bringing the two contrasting sides of

research theory together.

Figure 3.1 – A Research Framework

The diagram above presents a framework for researching the questions that arose during the

investigation of the central question – What is the future of the RDBMS in the Enterprise? It

is based on a qualitative and inductive research methodology. The framework has been

constructed from information from several sources including Robert Burns, (2000); Trochim

(2006); and, Ellis and Levy, (2008).

The main concept of the framework design is the cyclical returning to the source data, (in the

case of research for this dissertation - the literature review) as hypotheses are validated,

Question for

Investigation

Observation and

investigation of

instances of

phenomenon in

similar context

[Literature review]

Emerging Patterns

Form hypothesis

Test hypothesis

Form theories,

new ideas and /or

conclusions

Hypothesis

still valid?

Yes

No


Page 57

modified or discarded. It is noted that Burns differentiates between searching for data which

supports or refutes a hypothesis and analysis of collected data to develop theories and ideas in

an open-minded way, and in which new directions for continuing research can emerge

(Burns, 2000 p. 390). For this dissertation a combined approach was taken. An initial swell of

information relevant to the subject was collected. Analysis of this information gave rise to

theories. Where new ideas resulted, they were either allowed to develop into further research

or else the idea was discarded for further study elsewhere. Ideas which were interesting but

not used are recorded in chapter four.

3.4 Research Design

More specific to the purpose of this dissertation, the following research design was selected.

• Research question: What is the future of the RDBMS in the Enterprise?

• Collection and classification of relevant information

• Hypothesis – There is a future for the traditional RDBMS in the enterprise.

• Historical investigation of traditional RDBMS.

• Investigation of new type databases and data management systems, and what do they

offer.

• Comparison and contrast of both traditional and new models.

• Question - Are there any drivers for change and if so what are they?

• Hypotheses of drivers for change - new applications, social behaviour, the data

volume problem.

• The marketplace for data management systems traditional and new

• Selected Case Studies supporting or refuting our hypotheses.

• Research on research methodology – are we doing it well enough?

• Conclusions - hypothesis true or false?


Page 58

3.5 Methodology - A Qualitative Approach

The aim of the research is to hopefully provide new knowledge regarding the future role of

relational database management systems in enterprise organisations. The inquiry hypothesises

that new applications of technology (business intelligence for competitive edge, the advances

of science, information highways and more), and changes in social behaviour (social

networking and sustainable living for example) are presenting organisations with a potential

data deluge (The Economist, 2010, Vol. 394, No. 8671, p. 11). ‘Big Data’ as it has come to

be known is a problem requiring new data management systems. Over the last few years the

big names in data processing and managing technology, Oracle, IBM, SAP and Microsoft

have spent over €15 billion in acquiring software companies specialising in data analysis and

management (Cukier, 2010, p. 4). This dissertation looks at what ‘Big Data’ could mean for

the traditional relational type data management systems.

To investigate the questions raised and the emerging hypotheses derived from the above

exploration a qualitative research approach was used.

3.6 Methods

Research methods were chosen in keeping with the framework, design and methodology

previously stated. Donald Radcliff’s paper provides a useful compilation of qualitative

research methods from which the following were chosen. Ratcliff (2000) summarises the

approach similar to others:

• Observation of events and behaviors within a common context – examples: industry

sector, academia, sustainability, social networking

• Recognition of patterns of similar events or behaviors

• Answers induced from the findings.

• Content analysis – looking for emerging themes


Page 59

3.6.1 Method - Analytic Induction

The steps of the analytic induction method is summarised by D.R. Cressey (cited by Ratcliff,

1994): (1) An idea to be explored is formed; (2) A hypothesis is developed, (3) An

occurrence of an event or behaviour (phenomenon) related to the hypothesis is sought (4) The

phenomenon either supports or refutes the hypothesis, (5) Additional phenomena is sought,

(6) The hypothesis is tested every time a new instance of the phenomena is found.

3.6.2 Method - Content Analysis

Content Analysis looks for emerging trends in documents, interviews, text, articles, academic

and industry papers which help the researcher to classify the information (Ratcliff, 1994). For

this dissertation, the data/information repository and categorisation features of Zotero were

used. Zotero, in its basic version, is a freely available Internet browser tool which allows

researchers to collect, categorise, organise, and share research sources. It also provides an

easy method for citing sources.). Zotero is produced by the Centre for History and New

Media at George Mason University. It cites United States Institute of Museum and Library

Services, the Andrew W. Mellon Foundation, and the Alfred P. Sloan Foundation among its

sponsors.

It was not only nice but also relevant (from a learning aspect at least), that we could interact

with a database in this very practical way while at the same time using the fruits of this labour

to write this dissertation.

3.6.3 Method - Historical Research

Historical research involves the collection and analysis of information from past events as

an aid to exploring present and future events (Gay, 1996). Again, trends and patterns

provide the researcher with useful pointers towards further courses of exploration of

hypotheses under test. An initial hypothesis may be modified or even abandoned if findings

arising from steps 5 and 6 of the analytic induction method above suggest it. The historical

analysis stage was critical proper research for this dissertation as the initial question


Page 60

involved looking at past events, evaluating their relevance to the hypothesis regarding the

future. It is here that assumptions were tested and where disappointment and frustration

lurked at the end of every blind alley. Exceptions to any trends or patterns discovered were

also be investigated and accounted for. In other words, an open mind was required in

reaching a conclusion to the findings.

3.6.4 Method - Case Study

There is difficulty in pinning down an exact definition of a case study. According to Burns

(2000, p. 459) it has become a ‘catch-all’ term for methods research that does not fit into the

any of the other categories of research methods.

The Collins English dictionary definition is useful: “the act or an instance of analysing one

or more particular cases or case histories with a view to making generalizations”. The three

case studies in this dissertation investigate the ideas raised elsewhere in the research with

specific contexts. The exploration of themes, trends or patterns within the bounded scope of

the case study allows for a general theory about the subject under investigation to be formed.

For example, if similar events or behaviours are found to occur for different parties within the

same industry sector can a general theory be formed, and do any exceptions alter that theory?

3.6.5 Method - Grounded Theory

Grounded Theory was originally defined by Glaser and Strauss in 1967 (in Burns, 2000, p.

433) and is perhaps the over-riding method associated with qualitative research. Burns

defines it as the theories that emerge from the body of data as it is analysed together with that

which was discovered previously and includes the testing of speculative ideas (2000, p. 433).

It is related to the cyclical testing of hypotheses as they are formed in the context of the

research framework discussed above.


Page 61

3.7 Ethics Approval

Research for this dissertation relied on available sources of data and information already in

existence. No interviews or surveys were conducted during the course of the research. Any

interviews or surveys cited in this dissertation were carried out be third parties previously and

are freely available within the public domain. Early on in the research process ethics approval

was sought from the college on the basis that interviewing experts may be required part of the

case studies. The application was rejected on the grounds of insufficient information.

Following a decision to follow a qualitative approach to the research, involving a ‘grounded

theory’ method, it was decided that the detail sought by the college ethics approval process,

specifically the interview questions could not be formed until later in the research. It is felt

that this is a limitation of the ethics approval process, especially for qualitative analysis

where iterative and exploratory methods of research are used to generate ideas.

3.8 Audience

The intended audience in the first instance is the academic staff of the college and any

external readers chosen by the college, who may read this as part of the final year

undergraduate examination. Other students may find this dissertation informative in itself, or

as a starting point for further research into database management systems, the emerging data

volume problem, the case studies, and/or the research methods used. It may also be useful to

students of business who wish to expand their knowledge of database systems as well as the

current and future database market.

3.9 Significance of research

This dissertation aspires to add to the knowledge of some of the areas discussed. There are

many artefacts (articles, documents, papers, online discussions) available on the subject of

RDBMS, new and emerging databases, and ‘Big Data’. This dissertation attempts to distil the

most relevant information to extract the most essential ideas about the future role of the

RDBMS within enterprises.


Page 62

3.10 Limitations of the research methodology

The limitations of the research have been mentioned earlier in this chapter. These limitations

are to a large degree associated with the qualitative approach taken. Reliability and validity

testing of data is more easily applied to quantitative analysis. The sources of information for

this dissertation are textual in form and exist in various contexts. This together with the

subjective interpretations of the researchers means that key assumptions must be questioned

and generalisations must be supported. The scope and purpose of the dissertation has

prompted an approach whereby enough data is collected as a fair representation of the corpus

of information on the subject, to allow a holistic examination possible.

One possible area for future study of research theory is related to our topic of ‘Big Data’.

The huge of amount of data (specifically textual forms) now available to researchers via the

Internet presents a potential for new hybrid methods of research combining quantitative

analysis tools and qualitative approaches. In this respect the growing area of Business

Intelligence can provide nourishment to research intelligence.

3.11 Conclusion

The chapter opened with two quotes from Hermann Nernst and Will Durant illustrating the

symbiotic nature of knowledge and research. It could be said that new knowledge exists in

the brief period between the death of one inquiry and the rebirth of another. While

epistemology has not been the subject of this chapter proper research should not undertaken

without a basic understanding of the underlying theory. This chapter began with a brief

discussion of research theory. It then proceeded to put some of the theory into practice for the

specific purpose of this dissertation. A theoretical framework was illustrated and a research

design set out. A chosen methodology was explained along with its associated methods. The

limitations of the research were stated together with areas for future research.


Page 63

Chapter Four - Conclusions, Limitations of Research and Future

Work.

4.1 Introduction

In the first chapter we proposed a research question. That question was examined and

defined. This formed the basis of our hypothesis which we framed in such a way as to

predicate that there was a future for the RDBMS in the Enterprise. The process of examining

and testing the validity of that predicate occurred during the literature review in chapter two.

Chapter three looked at the research methodology chosen that best suited our purpose. The

research question prompted a qualitative approach. A selection of qualitative methods were

chosen as being most appropriate to the nature of the secondary research being undertaken.

We found the application Zotero useful for organising and classifying our Internet sourced

information. The tagging features in particular provide a flexible method of cross-referencing

material supporting or refuting the hypotheses under examination.

This final chapter is focused on bringing the various threads of our research to a conclusion.

It also presents possible topics for future research related to the content of this dissertation.

This future research is that work which for various reasons we could not cover in sufficient

enough detail to do it justice. These topics emerged as branches off the core of our research

stem. The framework of our research methodology enabled us to assess each branch as a new

idea against our central theme. The last part of chapter acknowledges that the decision to

exclude certain topics from detailed discussion presents one limitation of our research. That

and other limitations are summarised here.


Page 64

4.2 Conclusions

The following conclusions section is presented in the same sequence as the topics discussed

in the literature review of chapter two.

4.2.1 RDBMS

The discussion earlier on RDBMS presented some interesting findings. Particularly, some of

the historical aspects concerning the development of RDBMS revealed insights into why

RDBMS has a durable and sustained presence within IT. The strength of the early work on

RDBMS carried out by certain IBM research groups was it seems to have been bolstered by

the lack of focus IBM gave to RDBMS initially. With the IBM corporate eyes on other more

commercial DBMS goals the research teams had the time develop their relational version

without the compromising constraints of getting a product ready for market. When IBM did

finally turn their attention towards relational DBMS’ in the early 1980’s they already had the

basis of a good product in System R. One of the key disadvantages of System R at the time

was its physical size (large enough fill a room) compared to the newly emerging mid-scale

systems. SDL’s (now Oracle) DBMS offering for example was comparable to a desktop size

PC today. However, it is the portability of System R’s logical design that has left a lasting

legacy.

4.2.2 New DB’s

Currently, there is a lot of development of new database technologies, with different

companies developing different systems to meet, initially, their own internal requirements.

Some of these database technologies have then been open sourced to the community, to allow

others, be they companies, organisations or individuals, to benefit from the experience of the

pioneer developers and users of those systems.

It does, however, result in a fragmented market of databases with any number of compatible

and complimentary database systems being developed concurrently. This could result in

confusion for any organisations of individuals whom wish to use a database, but are unsure of

which model is right for them.


Page 65

4.2.3 Market

Currently, the database market is dominated by a few large proprietary vendors, many of

whom have complete system offerings. They are however, being challenged by smaller

players with innovative offerings. As these challengers are gaining traction in the markets,

they are showing that big vendors have gaps in their portfolios.

One of the interesting gaps being filed is in the area of ‘Big Data’ analytics. This is an area

that is of importance for BI and the ability of companies to gain value from all the data that

they have accumulated, thus allowing them to implement new strategies or make day-to-day

decisions. Many of the disruptive technologies that new companies are developing are

targeting this area of the database market for growth. In turn, this has made them targets for

the big players looking to enter new markets or to add value to their portfolios. Usually when

the best features of one entity are put with the best features of another creating a single entity,

we call it a ‘hybrid’. Large vendors through their acquisition of new technologies are now

beginning to offer combinations of data management systems in what is often referred to as a

‘stack’ offering. RDBMS would seem to remain an important ingredient within those stacks

as its best features are put to work alongside newer technologies. Trusted vendors together

with stacks incorporating older but reliable technology provide an absorption path for less

mature open source products into the traditionally cautious enterprise DBMS market.

The jobs market for employees with the necessary skills for development of new NoSQL

databases is currently in its infancy. The numbers may be small, but given the relative youth

of the sector, and having a three-fold increase in demand for qualified staff with the relevant

expertise, the market for non-relational databases like Hadoop and HBase/MapReduce is

likely to increase substantially with time.

The decision making around the suitability of NoSQL or SQL based RDBMS’ for enterprises

is dependant on a number of factors. The relative immaturity of open source systems and the

unstable nature of open source development can evoke caution amongst Chief Information

Officer’s or their equivalents. However, established vendors are starting incorporate open

source into their product offering. CA’s well established product Ingres has already moved to

an open source platform in the hope of leveraging competitive advantage by coupling open

development with CA Ingres’ reliability and brand. There are many mid and small scale data

management systems worth considering even within enterprises for achieving certain project

objectives.


Page 66

4.2.4.1 Case Study 1 - Utility Companies

Utility companies, primarily energy, water and transport suppliers, are conservative in

general due to the historical constraints of state regulation. Even when that regulation is

relaxed or removed altogether, safety, cost and control measures tend to make these

enterprises more vigilant when it comes to investing in IT. However, the landscape for

utilities is changing rapidly and ‘Big Data’ looms large. For Electricity providers such as the

ESB the investment in Smart Grid Networks means investing in information systems that can

manage large amounts of data in real time as well as providing for off-line analytics. For the

more risk adverse companies the decision making may involve uncomfortable choices as

solution providers move towards open source technologies as part of their offerings. The fact

that most of the large scale and trusted vendors have Utilities solutions for managing large

data volumes may provide some cushion of comfort albeit at a higher cost.

4.2.4.2 Case study 2 - Social Networks

It is clear to see that new systems at Social Networking companies such as Facebook and

Twitter require both speed and consistency for their new services to be effective and adopted

by their users. As such, architectures based on older RDBMS models no longer provide the

necessary performance for them to be utilised on a large scale for day-to-day applications

within the Social Networking community.

Utilising open source NoSQL database architectures such as Hadoop/MapReduce allows ‘Big

Data’ users such as Facebook and Twitter to run applications on websites and crunch data

more efficiently than using traditional RDBMS model databases such as MySQL. With their

successful use by high profile adopters, it will provide a platform from which NoSQL

databases can attract the attention of other companies and organisations wishing to utilise the

power and functionality of the likes of Hadoop/MapReduce and Cassandra for their own

database requirements.


Page 67

4.3 Future Research

4.3.1 NoSQL

NoSQL databases have only been in existence for a few years, no more than a decade at best.

With the dominance of RDBMS offerings in the market, NoSQL’s youth and adoption by the

open source community will eventually see NoSQL compete with traditional database

models. However, there are caveats on the horizon. According to Market Research Media, the

future market for NoSQL databases will grow from ‘no value’ today to $1.8bn by 2015.

Additionally, this growth will eventually be dependent on NoSQL databases incorporating

transactional consistency into their infrastructures, as this is seen as an obstacle to mass

adoption of NoSQL (Market Research Media, 2010). This requirement for transactional

consistency opens a new area of research in the future; to determine whether or not NoSQL

developers and vendors can incorporate into their designs a database architecture that does

not currently lend itself to NoSQL database constraints (or lack of them) to function.

Brisk, a recently announced NoSQL database from Datastax (Datastax, 2011), is a hybrid

design of Cassandra that incorporates elements of Hadoop and Hive into its design. The

purpose of Brisk is to marry the low-latency capabilities of Cassandra with the analytical

capabilities of Hadoop/Hive (Datastax, 2011). According to Matthew PFeil, Brisk is capable

of providing a tight feedback loop between real-time application and the analytics that follow

(PFeil in Harris, 2011). If Brisk performs as advertised by Datastax, it has the potential for

providing companies and organisations a fast real-time means of analysing big-data at a

greater performance speed. However, alternative systems are being developed by other

vendors, including the big players such as IBM and EMC, whom are unhappy with the

current performance of Hadoop/MapReduce (Harris, 2011). Additional research into the state

of the NoSQL market to reflect the current and future developments of Hadoop/Cassandra,

and to determine, if possible, a trend in the market that will indicate a preferred platform for

NoSQL databases, is recommended. Although not transactional in nature, incorporating the

different capabilities of current NoSQL databases may involve the adoption of transactional

qualities at a later date.


Page 68

4.3.2 Case Studies

During the case study on utility companies it was found that Tennessee Valley Authority has

adopted an open source platform for its management of the data. There may be other utilities

company that have taken this path. Further primary research is required to ascertain if this is

the case and if so to evaluate the results.

4.3.3 Business Intelligence

One of the areas of growth relating data management is Business Intelligence (BI). The

market for BI platforms has seen much consolidation recently as large vendors acquire new

technologies. Oracle, IBM, Microsoft and SAP now offer complete ‘stacks’ for BI (Sallam,

2010). The RDBMS fit within the architecture of the various BI solutions available is one

research thread which requires further time and study to follow to conclusion.

4.3.4 Research Methodology

On the matter of the research methodology itself, there is scope for further study. As

mentioned earlier we found that by applying appropriate tags to our Internet sourced research

material, a not unsubstantial amount, we could retrieve information quickly, relevant to

whatever idea or research thread we happened to be following at the time. Tagging data is not

new and is related to the meta-data concept used by many information systems today. What is

interesting and possibly merits further study is how our tagging enabled us to correlate

information based on context. By carefully choosing combinations of keywords into multiple

tags complex relationships could be built. As relationships are at the core of RDBMS perhaps

a revisit of Edgar Codd’s concept of related domains is worth re-examination. Blowing the

dust of some of Codd’s earliest papers might be a good starting point.

Another case for future study concerns the Internet search engines used by us (Google in

particular) for the purpose of retrieving information. Google presents a list of items based in

order of popularity. Care was taken not to blindly accept the first few returned items on page.

Also, it should be noted that a particular algorithm used by a search engine may influence

patterns in the returned information that may or may not be reliable. Common sense we hope

prevailed in our assessments of key assumptions. In this matter further study is necessary if

we are to maintain a confidence in information search systems especially those operating

under competitive business models.


Page 69

4.4 Limitations of the Research

The aim of this dissertation is to present to the reader with a review of the available

information selected by us from various sources in relation to the research question and to

propose some conclusions based on our findings. In the final section of chapter three the

limitations of our chosen research methodology was discussed. In this section outlines the

limiting factors which impacted on the overall content of the dissertation.

Due to the relative youth of the NoSQL database industry, research material is limited and

much of it has not been robustly tested over a sufficiently large period of time to determine

NoSQL’s effectiveness and impact on the database market outside a select few segments of

the industry. There is a lack of published academic literature to support the conclusions of

database vendors that have NoSQL offerings, save published literature from the developers

and users of the databases themselves much of which may be biased.

Given that the NoSQL industry is youthful in comparison to RDBMS, the greatest source for

research material has been via the Internet. This has caused its own problems, as the great

body of material available has made it difficult to determine what material and articles will

have the best benefit for our research, as almost anyone can have an opinion on any given

topic. As such, we have tried to restrict ourselves to original source material, as well as

sources with journalistic and research integrity, such as Forrester and Gartner.

Additionally, having taken upon ourselves a research topic in what has proved to be in a

dynamic, youthful and fluid industry, developments can occur almost daily that can influence

and/or effect conclusions that we have made already, requiring a readjustment of our findings

regularly throughout our research. This has occurred most recently with the announcement by

a company called Datastax that they are releasing a hybrid of the two most relevant open

source NoSQL database offerings called Brisk. This development occurred subsequent to us

completing the gathering of our research material. Given additional time, we would have

incorporated this development into the main body of our dissertation, instead of leaving it for

others to follow.


Page 70

“The outcome of any serious research can only be to make two questions grow where

only one grew before”

Thorstein Veblen (1857 – 1929)

The quotation above is from the US economist & social philosopher Thorstein Veblen. It

sums up the notion that the limitations of research are bounded only by how many questions

there are. It is in this spirit that we pass on the baton to others to ask more questions of the

answers found.

4.5 Final thoughts

“Technology presumes there's just one right way to do things and there never is.”

Robert M. Pirsig

Having reached the end of our dissertation we do not propose any one overriding conclusion.

However, if one was forthcoming it might be stated in the form of advice. Anybody involved

in the decision making regarding technology should bear in mind the above quotation from

Pirsig. Enterprises embarking on new information systems projects or a review of their data

management requirements should understand the strengths and weaknesses of the various

offerings. They should be careful to analyse their choices with respect to the overall business

objective. This means understanding where critical performance is required and where

sacrifices can be made. There is no ‘one fits all’ solution. RDBMS and newer emerging non-

relational DBMS’ are each merited for certain objectives and it very much depends on where

the overhead is placed. Trust and reliability are key factors applying to both vendors and their

products. It is hoped that this dissertation has presented a case for maintaining an open mind

when it comes to choosing data management systems for the enterprise. The scope and

quality of solutions available to manage the increasing scale and complexity of information in

the enterprise is growing and by no means excludes legacy systems such as RDBMS.

RDBMS’ in the future may no longer hold the bulk of the enterprise’s data but what it does it

does well. The good work carried out at the beginning will ensure that the RDBMS has still

got much to offer.


Page 71

REFERENCES

Aclara Software MDMS , (2008) Scalability White Paper http://www.aclaratech.com/AclaraSoft/WhitePapers/Aclara%20Software%20MDMS.pdf [Accessed on 05/02/2011] AMT Sybex. (no year) SMART DTS – AMT - Sybex’ Smart Utility Solution. http://www.amt-sybex.com/media/pdf/SmartDTS2010.pdf. [Accessed on 01/03/2011] AMT Sybex case study (2009) ESB Market Opening Information Exchange Project, http://www.amt-sybex.com/OurEurope/ESBCaseStudy.aspx. [Accessed on 17/02/2011] Anthes, Gary, (2010), Happy Birthday, RDBMS, Communications of the ACM. Volume 53/Number 5/ May 2010. Bachman, Charles (1973), Programmer as Navigator, Turing Award 1973 Speech, ACM http://awards.acm.org/images/awards/140/articles/1896680.pdf [Accessed on 27/03/2011] Anand, A., (2008). PIG – The Road to an Efficient High-Level language for Hadoop http://developer.yahoo.com/blogs/hadoop/posts/2008/10/pig_-_the_road_to_an_efficient_high-level_language_for_hadoop/ [Accessed on 07/03/2011] Apache, (2011). http://cassandra.apache.org [Accessed on 06/03/2011] Apache (2010a). Architecture Overview - Cassandra Wiki http://wiki.apache.org/cassandra/ArchitectureOverview [Accessed on 07/03/2011] Apache, (2010b). HDFS Architecture http://hadoop.apache.org/common/docs/r0.20.2/hdfs_design.html#Replication+Pipelining [Accessed on 07/03/2011] Arnett, Bob. (2011) “Extreme IT Makeover” Transforms Georgia Utility. Electric Energy Online. Available from http://www.electricenergyonline.com/?page=show_article&mag=54&article=404 [Accessed on 04/02 2011] Barrass, Robert, 2005, 3rd ed, Students Must Write, Routledge, UK. BioMed Central, The Hawthorne Effect: a randomised, controlled trial. Available from: http://www.biomedcentral.com/1471-2288/7/30 [Accessed on01/03/ 2011]


Page 72

Bayes, Pere Urbon (2010). Graph Theory and databases. http://nosql.mypopescu.com/post/2316706732/graph-theory-and-databases [Accessed on 30/03/2011] Bednarz, A, Feb 13, 2006, Utility tackles data integration, analysis, Network World; Volume 6 (23), pg. 34, ABI/INFORM Global. Bocchino, Robert L. Jr., Book Review – ‘Software Patents’ by Gregory A. Stobbs. Harvard Journal of Law & Technology Volume 9, Number 1 Winter 1996. New York, N.Y. John Wiley & Sons, Inc. 1995. Pg. 623. http://jolt.law.harvard.edu/articles/pdf/v09/09HarvJLTech213.pdf [Accessed on 05/02/2011] Bocij, Paul et al (2006) Business Information Systems. Essex: Pearson Education Ltd. Bord Gáis Strategy, Available from: http://www.bordgais.ie/corporate/aboutus [Accessed on 05/02/2011] Broad, Keith, 2011, CASE STUDY - Bluewater Power Goes ERP Route to Address Deregulation. Available from: http://www.electricenergyonline.com/?page=show_article&mag=57&article=374 [Accessed on 10/02/2011]. Brown, J. M., November 2 2010 17:22, Energy policy: It pays to be smart when planning for power needs, Financial Times, Available from: http://www.ft.com/cms/s/0/4e4d638c-e611-11df-9cdd-00144feabdc0.html#ixzz1DyGcqQLR [Accessed on 04/02/2011]. Bryant, Randal E., & Kwan, Thomas T. (2008). Milestone Week in Evolving History of Data-Intensive Scalable Computing. Carnegie Mellon University and Yahoo! www.cra.org/ccc/docs/bigdata_highlights.pdf [Accessed on 10/04/2010] Business Dictionary (2011). Business Dictionary. WebFinance, Inc. http://www.businessdictionary.com/definition/database.html [Accessed on 14/03/2011] Bylund, A., (2011) Teradata: A Study of Contrasts http://www.fool.com/investing/general/2011]/02/14/teradata-a-study-in-contrasts.aspx [Accessed on 06/03/2011] Cambridge (Online Edition)(2011) Cambridge Advanced Learner’s Dictionary. Cambridge University Press. UK. Cambridge (3rd. Edition.)(2008). Cambridge Advanced Learner’s Dictionary. Cambridge University Press. UK.


Page 73

Carole, J, The Surprise Winners in the $34 Billion Smart Grid Market, Lux Research, Inc. http://www.businesswire.com/news/home/20110126005312/en/Surprise-Winners-34-Billion-Smart-Grid-Market [Accessed on 10/04/2011] Cassandra, (2011). http://cassandra.apache.org/ [Accessed on 28/03/2011] Cassandra wiki, (2011). Cassandra:L Architecture Overview http://wiki.apache.org/cassandra/ArchitectureOverview [Accessed on 28/03/2011] Cattell, Rick (2011). Scaleable SQL and NoSQL Data Stores http://cattell.net/datastores/Datastores.pdf [Accessed on 29/03/2011] Chamberlin, Donald D., (1976). Relational Data-Base Management Systems IBM Research Laboratory, San Jose, California 96193. Computer Surveys. Vol.8/No.1. March 1976. http://www.dpi.inpe.br/cursos/ser303/relational_csur.pdf [Accessed on 10/04/2011] Chang, F, Dean, J, et al, (2006). Bigtable: A Distributed Storage System for Structured Data. http://labs.google.com/papers/bigtable.html [Accessed on 06/03/2011] Chu, Eric; Beckmann, Jennifer; and Naughton, Jeffrey, 2007, The Case for a Wide-Table Approach to Manage Sparse Relational Data Sets. Available from: http://www.citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.107.36 [Accessed on 17/02/2011] Codd, Edgar, F. (1985). "Is Your DBMS Really Relational?" (Part 1) and "Does Your DBMS Run By the Rules?" (Part 2). ComputerWorld, October 14 and October 21. Codd, Edgar, F. (1991, Version 2) The Relational Model for Database Management, Reading: Addison-Wesley. Codd, Edgar, F. (1970) A Relational Model of Data For Large Shared Data Banks. Communications of the ACM. Volume 13/Number 6/ June. 1970 http://cacm.acm.org/magazines/1970/6/12368-a-relational-model-of-data-for-large-shared-data-banks/pdf?dl=no [Accessed on 10/04/2011] Cohen, J., (2010). Facebook Showcasing Its Open Source Database http://www.allfacebook.com/facebook-showcasing-its-open-source-database-2010-11 [Accessed on 07/03/2011] Cukier, Kenneth. (2010) ‘Data, data everywhere - A Special Report on Managing Information’. The Economist, Vol. 394, No. 8671.


Page 74

Datastax, 2011. DataStax Rewires Hadoop for Low-Latency Applications with Apache Cassandra http://www.datastax.com/news/press-releases/datastax-rewires-hadoop-for-low-latency-applications-with-apache-cassandra [Accessed on 07/04/2011] Datawarehouse4u (2009). OLTP vs. OLAP. Datawarehouse4u.info

http://datawarehouse4u.info/OLTP-vs-OLAP.html [Accessed on 05/03/2011] Davison, Robert M. (1998) (Submitted Phd) ‘Chapter 3 Research Methodology’ in An Action Research Perspective of Group Support Systems: How to Improve Meetings in Hong Kong., City University. Hong Kong. http://www.is.cityu.edu.hk/staff/isrobert/phd/phd.htm. [Accessed on 22/02/2011]. DBMS2, (2011). DataStax OpsCenter for Apache Cassandra announced http://www.dbms2.com/2011]/02/01/datastax-opscenter-cassandra/ [Accessed on 06/03/2011] Dimitrov, Marin (2010). NoSQL databases. http://www.slideshare.net/marin_dimitrov/nosql-databases-3584443?from=ss_embed [Accessed on14/03/2010] EJBRM, e-Journal of Business Research Methods, July 2003, Vol. 2, Issue 2, http://www.ejbrm.com/volume2/issue2, [Accessed on 20/02/2011]. Ellis, Timothy J. & Levy, Yair. (2008) ‘Framework of Problem-Based Research: A Guide for Novice Researchers on the Development of a Research-Worthy Problem’. International Journal of an Emerging Transdiscipline Volume 11. Elmasri, Ramez and Nahathe, Shamkant B. (1989) Fundamentals of Database Systems. California: The Benjamin/Cummings Publishing Company. Engage Consultants, May 2010, High-level Smart Meter Data Traffic Analysis, For: ENA, Document Ref: ENA-CR008-001-1.4 Available from: http://www.energynetworks.org/ena_energyfutures/ENA-CR008-001-1%204%20_Data%20Traffic%20Analysis_.pdf [Accessed on 17/02/2011] ESB Strategy Framework 2020, Available from: http://www.esb.ie/main/sustainability/strategy-to-2020.jsp [Accessed on 05/02/2011] ESB Strategy on Sustainability, available from: http://www.esb.ie/main/sustainability/smart-meters.jsp [Accessed on 5 February 2011


Page 75

Evans, B. (2011). Global CIO: IBM’s Most Disruptive Acquisition Of 2010 Is Netezza. http://www.informationweek.com/news/global-cio/interviews/showArticle.jhtml?articleID=229201238&queryText=Jim%20Baum [Accessed on 06/03/2011] Feinberg, D., Beyer, M. A. (2010). Magic Quadrant for Data Warehouse Database Management Systems. http://www.businessintelligence.info/docs/estudios/Gartner-Magic-Quadrant-for-Datawarehouse-Systems-2010.pdf [Accessed on 05/03/2011] Fehrenbacher, Katie, Jan. 24, 2010, Smart Grid 101: Utilities Are Very Risk Averse, Available from: http://gigaom.com/cleantech/smart-grid-101-utilities-are-very-risk-averse/ [[Accessed on 15/02/2010]] Fehrenbacher, Katie, Nov. 11, 2009, Why Open Source for the Smart Grid Needs a Kick-Start. Available from: http://gigaom.com/cleantech/why-open-source-for-the-smart-grid-needs-a-kick-start/. [Accessed on 15 February 2011. Fink, D, (2010), A Little Gray Switch Is a Hot-Button Issue for Solar Homeowner [Internet], The Solar Home & Business Journal, Available from: http://solarhbj.com/2010]/02/disconnect-switch-hot-button-issue-for-solar-owners-000100.php [Accessed on 15/02/2010]

Finkle, J. (2008). Microsoft, Oracle databases gain share vs IBM http://www.reuters.com/article/2008/08/26/software-database-idUSN2634118720080826 [Accessed on 06/03/2011] Gardner, D., (2010). IBM acquires Netezza as big data market continues to consolidate around appliances, middle market, new architecture. ZDnet blog, 21 September http://www.zdnet.com/blog/gardner/ibm-acquires-netezza-as-big-data-market-continues-to-consolidate-around-appliances-middle-market-new-architecture/3861?tag=content;search-results-rivers [Accessed on 06/03/2011] Ghosh, Debasish. (2010) Multiparadigm Data Storage for Enterprise Applications. IEEE Software, Los Alamitos: Sep/Oct 2010. Vol. 27, Iss. 5; p. 57 http://proquest.umi.com.elib.tcd.ie/pqdlink?index=2&did=2110653061&SrchMode=1&sid=4&Fmt=6&VInst=PROD&VType=PQD&RQT=309&VName=PQD&TS=1300019837&clientId=11502 [Accessed on 14/03/2011] Giroti, Tony., 2011,You’ve Got the Meter Data – Now What?, Electric Energy Online. Available at: http://www.electricenergyonline.com/?page=show_article&mag=66&article=524 [[Accessed onFebruary 4, 2011].


Page 76

Graham, C., Sood, B., Sommer, D., Horiuchi, H.. (2010). Cited in: MarketShare: RDBMS Software by Operating System, Worldwide, 2009. http://www.oracle.com/us/products/database/number-one-database-069037.html [Accessed on 05/03/2011] Greenbaum, J. (2010). SAP buys Sybase, and History is Re-Written http://www.enterpriseirregulars.com/17887/sap-buys-sybase-and-history-is-re-written/ [Accessed on 06/03/2011] Greenplum, (2010). About page. http://www.greenplum.com/about-us/ [Accessed on 06/03/2011] Grimes, Seth, (2011) BridgePoint Article - Unstructured Data and the 80 percent rule. Clarabridge. http://clarabridge.com/default.aspx?tabid=137&ModuleID=635&ArticleID=551 [Accessed on 14/03/2011] Hammond, J., 24 March 2009, Open Source Adoption: What Your Peers Are UpTo, Forrester Research, Available from: http://www.linux.com/news/enterprise/biz-enterprise/345132-forrester-analyst-says-open-source-has-won [Accessed on 04/02/2010] Harris, Derrick, citing PFeil, Matthew, 2011. DataStax Shakes Up Hadoop with NoSQL-Based Distro http://gigaom.com/cloud/datastax-shakes-up-hadoop-with-nosql-based-distro/ [Accessed on 07/04/2011] Harris, Derrick, 2011. Yahoo Suggests MapReduce Overhaul to Improve Hadoop Performance http://gigaom.com/cloud/yahoo-suggests-mapreduce-overhaul-to-improve-hadoop-performance/ [Accessed on 07/04/2011] Hayes, Frank, (2002) The Story So Far. Computerworld. April 15, 2002 12:00 PM ET http://www.computerworld.com/s/article/70102/The_Story_So_Far [Accessed on 14/03/2011] Henschen, Doug, (2010) The Big Data Era: How Data Strategy Will Change. InformationWeek . August 7, 2010 02:00 AM. http://www.informationweek.com/news/showArticle.jhtml?articleID=226600216&cid=tab_art_entsoft [Accessed on 13/03/2011] Henschen, Doug. (2011a). Gartner Ranks Data Warehousing Leaders. http://www.informationweek.com/news/software/info/managemnt/showArticle.jhtml?articleID=229215658&cid=RSSfeed_IWK_All [Accessed on 05/03/2011]


Page 77

Henschen, Doug. (2011b). Global CIO: IBM’s Most Disruptive Acquisition Of 2010 is Netezza. http://www.informationweek.com/news/global-cio/interviews/showArticle.jhtml?articleID=229201238&pgno=2&queryText=Jim+Baum&isPrev= [Accessed on 06/03/2011] Higginbotham, S., (2010) Digg Not Likely to Give Up on Cassandra http://gigaom.com/2010]/09/08/digg-not-likely-to-give-up-on-cassandra/ [Accessed on 06/03/2011] Hoff, Todd., (2010). Facebook's New Real-time Messaging System: HBase to Store 135+ Billion Messages a Month. HighScalability Blog, November 16 http://highscalability.com/blog/2010]/11/16/facebooks-new-real-time-messaging-system-hbase-to-store-135.html [Accessed on 07/03/2011] Hoff, Todd (2009). Neo4J graph database kick buttox. http://highscalability.com/neo4j-graph-database-kicks-buttox [Accessed on 30/03/2011] Hoff, T., (2008). Product: Scribe - Facebook's Scalable Logging System http://highscalability.com/product-scribe-facebooks-scalable-logging-system [Accessed on 07/03/2011] Hypertable, 2011. http://www.hypertable.org/sponsors.html [Accessed on 30/03/2011] Irish Government Energy White Paper, Available from: http://www.dcenr.gov.ie/NR/rdonlyres/54C78A1E-4E96-4E28-A77A-3226220DF2FC/27356/EnergyWhitePaper12March2007.pdf [Accessed on 04/02/ 2011 IT Jobs Watch, (2011). http://www.itjobswatch.co.uk/jobs/uk/hadoop.do [Accessed on 06/03/2011] Kanaracus, C., (2011). EMC’s Greenplum Offers ‘Big Data’ Tools at No Charge http://www.pcworld.com/businesscenter/article/218355/emcs_greenplum_offers_big_data_tools_at_no_charge.html [Accessed on 06/03/2011] Kennedy, J., 4 Feb 2011, Utility firms lead US$950bn charge towards sustainable IT, Available at: http://www.siliconrepublic.com/strategy/item/20251-utility-firms-lead-the/ [Accessed on 15/02/2011] Kerner, S. M., (2010) Citing: Quinn, J. Digg Moves From MySQL to NoSQL http://itmanagement.earthweb.com/datbus/article.php/3870116/Digg-Moves-From-MySQL-to-NoSQL.htm [Accessed on 06/03/2011]


Page 78

Krill, Paul; Leon, Mark, (1996) Dual force. InfoWorld; Dec 2, 1996; 18, 49; ABI/INFORM Global pg. 37. http://proquest.umi.com.elib.tcd.ie/pqdlink?index=4&did=10472333&SrchMode=1&sid=9&Fmt=6&VInst=PROD&VType=PQD&RQT=309&VName=PQD&TS=1300022938&clientId=11502 [Accessed on 14/03/2011]

Lakshman, A., (2008). Cassandra – A structured storage system on a P2P Network http://www.facebook.com/note.php?note_id=24413138919 [Accessed on 07/03/2011] LaValle, S., Hopkins, M., Lesser, E., Shockley, R., Krushwitz, N.. (2010). Analytics: The new path to value. http://www.informationweek.com/whitepaper/download/showPDF.jhtml?id=180700006&site_id=300001&profileCreated=&k=IWKREG [Accessed on 05/03/2011] Lawton, C., 8 April 2008, Business Technology: Open-Source Databases Make Headway; Scaling Challenges Remain as Upstarts Carve Bigger Niche. Wall Street Journal (Eastern Edition), p. B.5. Retrieved February 16, 2011, from ABI/INFORM Global. (Document ID: 1458385931). Legal dictionary, Available from: http://legaldictionary.thefreedictionary.com/Public+Utilities [Accessed on15/02/2011] Lindenberger, M., ‘Open Source Still Earning Trust as CIOs Consider Enterprise Software Solutions’,[Blog], Posted: 24-Apr-2007 14:56:02 Available from: http://www.itbusinessedge.com/cm/blogs/bentley/open-source-still-earning-trust-as-cios-consider-enterprise-software-solutions/?cs=14422 [Accessed on 04/02/2011] Lohr, Steve, 20 April 2009, In Sun, Oracle Sees a Software Gem, New York Times Available from: http://www.nytimes.com/2009/04/21/technology/companies/21sun.html?_r=1&partner=rss&emc=rss [Accessed on 15/02/2011] Lorica, B., (2009). Most Hadoop jobs are in California http://radar.oreilly.com/2009/06/most-hadoop-jobs-are-in-california.html [Accessed on 06/03/2011] Mackie, Kurt. (2011). Microsoft Unveils SQL Server Fast Track Data Warehouse 3.0 http://tdwi.org/articles/2011/02/17/sql-server-fast-track-data-warehouse.aspx? admgarea=news [Accessed on 10/04/2011]


Page 79

Market Research Media, (2010). NoSQL Market Forecast 2011-2015, Tabular Analysis, Publication: 11/2010. Market Research Media. http://www.marketresearchmedia.com/2010/11/11/nosql-market/ [Accessed on 10/04/2011] McGoveran, David and Date, C.J., 2010) Letter to the Editor: How to Celebrate Codd’s RDBMS Vision. Communications of the ACM. Volume 53/Number 10/October 2010. http://mags.acm.org/communications/2010]10/?folio=7&CFID=14255234&CFTOKEN=78521835#pg9 [Accessed on 10/04/2011] McHugh, Josh. (1997) Michael Stonebraker: The ultimate database. Forbes. New York: Jul 7, 1997. Vol. 160, Iss. 1; pg. 326, 1 pgs [Accessed on 14/03/2011] McIssac, Kevin, (2007) The data deluge: The growth of unstructured data. Computerworld. 12 September, 2007 09:57 http://www.computerworld.com.au/article/195150/data_deluge_growth_unstructured_data/ [Accessed on 10/04/2011] McJones, Paul (Ed) et al. (1997) The 1995 SQL Reunion: People, Projects, and Politics, SRC Technical Note, 1997 – 018,August 20, 1997. http://www.mcjones.org/System_R/SQL_Reunion_95/SRC-1997-018.pdf [Accessed on 10/04/2011] Metz, C., (2011a). Google v Facebook salary inflation riles Big Data startup http://www.theregister.co.uk/2011]/02/04/cloudera_caught_between_facebook_and_google/ [Accessed on 03/06/2011] Metz, C., (2011b). HBase: Shops swap MySQL for open source Google mimic http://www.theregister.co.uk/2011]/01/19/hbase_on_the_rise/ [Accessed on 06/03/2011] Metz, C., (2010). Facebook unveils 'next-gen' messaging system http://www.theregister.co.uk/2010]/11/15/facebook_announcement/ [Accessed on 06/03/2011] Metz, C., (2008). Cokeheads slip AI onto Yahoo! front page http://www.theregister.co.uk/2008/07/10/yahoo_front_page_ai/ [Accessed on 06/03/2011] Miles, M. B. and Huberman, A. M. (1994) Qualitative data analysis: A sourcebook of new methods. SAGE. USA. Mitchell, Bradley. (No Date). http://compnetworking.about.com/od/speedtests/a/network_latency.htm [Accessed on31/03/2011]


Page 80

Monash, Curt (2010a). NoSQL Basics, Benefits and Best-Fit Scenarios. InformationWeek. October 10, 2010 03:33 PM http://www.informationweek.com/news/software/info_management/showArticle.jhtml?articleID=227701021&pgno=1&queryText=&isPrev [[Accessed on 13/03/2011] Monash, Curt. (2010b). Quick reactions to SAP acquiring Sybase. http://www.dbms2.com/2010]/05/12/sap-acquire-sybase/ [Accessed on 06/03/2011] Moran, Aidan P. (2000) Managing Your Own Learning At University –A Practical Guide, UCD Press. Dublin. Muthukkaruppan, K., (2010). The Underlying Technology of Messages http://www.facebook.com/note.php?note_id=454991608919 [Accessed on 06/03/2011] Neil, S., (2011). Citing: Gillespie, M. Sybase adds Enterprise Security to Android. http://www.managingautomation.com/maonline/exclusive/read/Sybase_Adds_Enterprise_Security_to_Android_27756683 [Accessed on 06/03/2011] Netezza, (2011). http://www.netezza.com/data-warehouse-appliance-products/twinfin.aspx [Accessed on 06/03/2011] Neubauer, Peter (2010). Graph Databases, NOSQL and Neo4j http://www.infoq.com/articles/graph-nosql-neo4j [Accessed on 30/03/2011] Neo4j, (2011). http://neo4j.org/ [Accessed on 31/03/2011] Oracle History (2011), Oracle. http://www.oracle.com/us/corporate/timeline/index.html [[Accessed on 27/03/2011] Oracle White Paper January 2011, Smart Metering for Electric and Gas Utilities Available from: http://www.oracle.com/us/industries/utilities/046593.pdf [Accessed on 17/02/2011] Oracle, 2007, Oracle Buys Lodestar, Available from: http://www.oracle.com/us/corporate/Acquisitions/lodestar/oracle-lodestar-faq-072227.pdf [Accessed on 15/02/2011] Oxford English Dictionary (8th Edition) (1990), The Oxford English dictionary of Current English. Oxford:Clarendon Press.


Page 81

Pariseau, Beth, (2009), Energy IT sees smart-grid boon,SearchStorageChannel.com. Available from: http://searchstoragechannel.techtarget.com/news/1355355/Energy-IT-sees-smart-grid-boon-for-data-storage [Accessed on15/02/2011]. Pariseau, Beth, (2008). IDC: Unstructured data will become the primary task for storage . IT Knowledge Exchange. Oct 29 2008 12:26PM GMT http://itknowledgeexchange.techtarget.com/storage-soup/idc-unstructured-data-will-become-the-primary-task-for-storage/ [[Accessed on14/03/2011] Patterson, David A. et al. (1988), A Case for Redundant Arrays of Inexpensive Disks (RAID). University of California, Berkley. Peschka, J., (2010). HBase and Hadoop at Facebook http://facility9.com/2010]/11/18/facebook-messaging-hbase-comes-of-age [Accessed on 07/03/2011] Pointer, R., et al (2010). Twitter Engineering: Introducing FlockDB http://engineering.twitter.com/2010]/05/introducing-flockdb.html [Accessed on 07/03/2011] PR Newswire Europe Limited, 2011 Available from: http://www.prnewswire.co.uk/cgi/news/release?id=160381 Prickett Morgan, T., (2010). Teradata pumps data warehouses with six-core Xeons http://www.channelregister.co.uk/2010]/10/25/teradata_appliance_refresh/ [Accessed on 06/03/2011] Prickett Morgan, T., (2011). EMC lets go of Greenplum Community Edition http://www.theregister.co.uk/2011]/02/01/emc_greenplum_community_edition/ [Accessed on 06/03/2011] ProQuest Database, via TCD Library. Available from: http://stella.catalogue.tcd.ie/iii/encore/search/C%7CSutilities%7COrightresult%7CU1?lang=eng&suite=pearl [Accessed on 10/04/2011] Rapp, Christof, ‘Aristotle's Rhetoric’, in Edward N. Zalta (ed.), The Stanford Encyclopedia of Philosophy (Spring 2010 Edition), Stanford. http://plato.stanford.edu/archives/spr2010/entries/aristotle-rhetoric. [Accessed on 22/02/2011] Ratcliff Donald, (Compiled by) 15 Methods of Data Analysis in Qualitative Research Retrieved from: http://qualitativeresearch.ratcliffs.net/15methods.pdf [Accessed on 10/04/2011] Ratcliff Donald E., 1994, Analytic Induction as a Qualitative Research Method of Analysis, The University of Georgia. http://qualitativeresearch.ratcliffs.net/analytic.pdf [Accessed on 10/04/2011]


Page 82

Rodrieguez, Marko A. (2010) Graph Databases. More than an Introduction. http://nosql.mypopescu.com/post/1173828185/graph-databases-more-than-an-introduction [Accessed on 30/03/2011] Rosenberg, Dave, 9 Nov. 2009, Open-source Hadoop powers Tennessee smart grid. Available from: http://news.cnet.com/8301-13846_3-10393259-62.html [Accessed on 16/02/2011] Russell, Bertrand. (1995) History of Western Philosphy. Routledge, London. Sallam, Rita L., (2010) Q&A: The Benefits and Perils of Buying Into the Megavendor Stack. Gartner Research 30 April 2010 ID No: G00200485. Selinger, Pat, (2005) ‘A Conversation with Pat Selinger’. Interview by Jim Hamilton, ACM Queue. Journal, April 18, 2005 http://delivery.acm.org/10.1145/1060000/1059803/p18-hamilton.pdf?key1=1059803&key2=7717841031&coll=DL&dl=ACM&ip=109.255.147.37&CFID=14259874&CFTOKEN=11948404 SmartGrids: European Technology Platform, 2006, ‘What is Smart Grids’, Available from: http://www.smartgrids.eu/?q=node/163 [Accessed on 12/02/2011] Souders, Steve (2009). Even faster Web Sites http://stevesouders.com/docs.velocity-20090622.ppt [Accessed on 31/03/2011] Stedman, C., (1997), Scottish utility tries to end the vendor blame game. Computerworld, Vol31(11), pgs 75, 80. Retrieved February 4, 2011, from ABI/INFORM Global. (Document ID: 11269142). Seligstein, J., (2011). See the Messages that Matter. The Facebook Blog, February 11 http://blog.facebook.com/blog.php?post=452288242130 [Accessed on 07/03/2011] St. John, J. (2011), There Will Be Nine Times the Smart Grid Data by 2020: Cleantech News and Analysis .Available at: http://gigaom.com/cleantech/there-will-be-nine-times-the-smart-grid-data-by-2020/ [Accessed on 06/02/2011. Stevens, Tim (2004). Overcoming High-Latency Database Access with Java Stored Procedures http://www.informit.com/articles/article.aspx?p=170870 [Accessed on 10/04/2011] Stonebraker, Michael, et al., (2010). MapReduce and Parallel DBMSs: Friends or Foes? Communications of the ACM. Volume 53/Number 1/January 2010.


Page 83

Stonebraker, Michael, and Dewitt, David J., (2008) A Tribute to Jim Gray, Communications of the ACM. Volume 51/Number 11/November 2008 http://rptcd.catalogue.tcd.ie/ebsco-web/ehost/pdfviewer/pdfviewer?vid=3&hid=10&sid=a54188b1-6075-4543-a891-3df2451bd16d%40sessionmgr14 Stoner, James A.F., & Freeman, Edward R., (1989) Management (Fourth Edition).New Jersey: Prentice-Hall International, Inc. Subramanian, K., (2010) Riptano, Cloudera For Cassandra http://www.cloudave.com/450/riptana-cloudera-for-cassandra/ [Accessed on 06/03/2011] Sybase, (2011). http://www.sybase.com/products/mobileenterprise/ianywheremobileoffice [Accessed on 06/03/2011] Teradata, (2011). Customer List http://www.teradata.com/t/customers-list/browse/ [Accessed on 06/03/2011] Tiernan et al. (3rd Edition) (2006). Modern Management – Theory and Practice for Irish Students. Dublin: Gill & Macmillan. Thiel, Carol Tomme. (1982). Relational DBMS: What's in a Name. Infosystems,1 29(9), 52. from ABI/INFORM Global. (Document ID: 1356651) http://proquest.umi.com.elib.tcd.ie/pqdlink?index=4&did=1356651&SrchMode=1&sid=14&Fmt=2&VInst=PROD&VType=PQD&RQT=309&VName=PQD&TS=1300026039&clientId=11502 [Accessed on 13/03/2011] Trefis Team (2011). Oracle Exadata Software Give Oracle Upside. http://www.trefis.com/articles/33739/oracles-exadata-software-give-oracle-20-upside/2011]-01-18 [Accessed on 05/03/2011] Trochim, Wiliam M.K. (2006) Deduction and Induction. Web Center for Social Research Methods, http://www.socialresearchmethods.net/kb/dedind.php. [Accessed on 22/02/2011] Van Tulder, Gijs, (2003) Storing Hierarchical Data in a Database Sitepoint. April 30th, 2003. http://articles.sitepoint.com/article/hierarchical-data-database [Accessed on 10/04/2011] Vittal, S., (2010). The Marketing Software Convergence Continues: Teradata Acquires Aprimo For $525 Million. Forrester Blog, December 23 http://blogs.forrester.com/suresh_vittal/10-12-22-the_marketing_software_convergence_continues_teradata_acquires_aprimo_for_525_million [Accessed on 06/03/2011]


Page 84

Von Finck, K., (2009). the indian wind along the telegraph lines; Blog Archive; The Global Database Market. Kurt Von Finck’s Blog, September 03. http://blogs.gnome.org/mneptok/2009/09/03/the-global-database-market/ [Accessed on 06/03/2011] Warren, C. (2011) Android, BlackBerry & iOS Tied for U.S. Market Share http://mashable.com/2011]/02/01/nielsen-smartphone-marketshare/ [Accessed on 06/03/2011] Weglarz, Geoffrey,(2004) Two Worlds of Data – Unstructured and Structured Information Management Magazine, September 2004. http://www.information-management.com/issues/20040901/1009161-1.html [[Accessed on14/03/2011] Weil, K., (2010). NoSQL at Twitter (2010). Strange Loop Conference, December 23 http://www.infoq.com/presentations/NoSQL-at-Twitter [Accessed on 07/03/2011] White, Tom (2010). Hadoop: the Definitive Guide (2nd Edition). Surrey: O’Reilly Media. Woods, D., (2010) How Digg’s Cassandra Debacle Could Have Been Avoided http://www.forbes.com/2010]/09/21/cassandra-mysql-software-technology-cio-network-digg.html [Accessed on 06/03/2011] Yahoo Research, 2011. PNUTS - Platform for Nimble Universal Table Storage http://research.yahoo.com/project/212 [Accessed: 09/04/2011] Yuhanna, Noel. (2010). Sybase Acquisition By SAP - A Great Move. Forrester Blog, May 17 http://blogs.forrester.com/noel_yuhanna/10-05-17-sybase_acquisition_sap_great_move [Accessed on 06/03/2011] Yuhanna, Noel. (2009). The Forrester Wave: Enterprise Database Management systems, Q2 2009. Forrester. June 30, 2009. Zawodny, J., (2007). Open Source Distributed Computing: Yahoo’s Hadoop Support. Jeremy Zawodny Blog, July 25. http://developer.yahoo.com/blogs/ydn/posts/2007/07/yahoo-hadoop/ [Accessed on 06/03/2011]


Page 85

APPENDIX

Table A.1 - Edgar Codd’s original relational model terms and their equivalent or alternative

meaning.

Term Meaning in

Codd’s

RDBMS

Model

Equivalent

in

commercial

RDBMS

Equivalent

in non

RDBMS

Example

Domain Both basic data type and extended data type in order to convey meaning

Meta Data -Not imposed – basic data type is used

Meta Data - Not imposed

Basic Data Type (rules) – Currency, €, Integer etc.

Extended data type(meaning) - Financial, Location etc.

Relation A subset of the Cartesian Product (Set Theory).

A set of composite data

A Table Table, Record, File...

NA

R-Table A base table usually created in the beginning as a root source

NA NA PARTS_TABLE

SUPPLIER_TABLE

Derived R-Table

A table created by combining data from two or more tables

NA NA PARTS_SUPPLIER_TABLE

Note 1:- naming convention is also specified by Codd.

Note 2: by combining two relations we are also combining two Domains (Parts and Suppliers) creating a new composite Domain, lets call it - Capabilities.


Page 86

Atomic Value

A data value which should not be sub-divided

Same Same Price, €100

Composite Value

A data value which can be sub-divided

Same Same An address field, text, an audio file.

Tuple A row in Table containing a set of related data.

Record, Record, Row, Column, file

Smith, John, 12 Greenview Road….31/12/1945…etc

Primary Key

One column in a table assigned to hold the unique identifier value for each row (not null)

Same but not imposed

NA PART_ID

123456

345636

Foreign Key

One or more columns in a table assigned to hold a reference to the Primary key field in another table. Used to link tables together and maintain the integrity of the information and relationships

Foreign, Secondary

As above but existing in a derived table.

cs4105_the _research_project_stuart_clancy_and _ed_fitzpatrick

Documents

data management systems

internet data

data type problem

abstract managing data

umbrella term big data

statistics trinity college

school of computer science

research project