memory virtualization in database systems

Memory Virtualization in

Database Systems

Angelos [email protected]

Master of ScienceSchool of Informatics

University of EdinburghAugust 2008

ABSTRACT

Virtualization is a technology that is rapidly transforming the IT landscape and

fundamentally changing the way that people compute. The benefits of virtualization

techniques are becoming more and more appealing nowadays where the demand for

high quality of service and error free systems is a requirement to every system. A lot

of work has been done in the domain of virtual servers but database virtualization is

still an open area. This work addresses an existing problem to Xcalibre's FlexiScale

software: provide a transparent online scalable database system. A solution to the

problem of database virtualization can be accomplished by implementing a

virtualized filesystem, i.e., provide a global namespace by unifying the storage disks

of any running virtual servers.

I

ACKOWLEDGEMENTS

First and foremost, I would like to thank my supervisor Dr. S. Viglas for his constant

guidance and support.

I would also like to express my gratitude to the experienced people of Xcalibre, and

especially to Mr. G. Munasinghe for their invaluable help.

II

DECLARATION

I declare that this thesis was composed by myself, that the work contained herein is my own except where explicitly stated otherwise in the text, and that this work has not been submitted for any other degree or professional qualification except as specified.

(Angelos Anastasopoulos)

III

TABLE OF CONTENTS

1 Introduction

1.1 Background..........................................................................................

1.1.1 Single to Many Virtualization.....................................................

1.1.2 Many to Single Virtualization.....................................................

1.2 Virtualization Benefits.........................................................................

1.3 Related Work.......................................................................................

1.3.1 Amazon EC2...............................................................................

1.3.2 AppLogic....................................................................................

1.3.3 XCalibre FlexiScale....................................................................

1.4 Proposal - Projects Aims.....................................................................

1.5 Specifications.......................................................................................

2 System Design

2.1 Current System....................................................................................

2.2 Possible Solutions................................................................................

2.2.1 Modifying the Database System.................................................

2.2.2 File System in Kernel Space.......................................................

2.2.3 File System in User Space..........................................................

1

2

2

5

6

6

7

8

9

10

11

12

12

14

14

15

16

IV

2.3 Our Approach.......................................................................................

2.3.1 Overview.....................................................................................

2.3.2 Our Implementation in Greater Detail........................................

2.4 Graphical User Interface......................................................................

3 Evaluation

3.1 Measuring TPS....................................................................................

3.1.1 TPS-B..........................................................................................

3.1.2 Selections Only...........................................................................

3.1.3 Discussion...................................................................................

3.2 Adding and Removing a Server...........................................................

3.2.1 Discussion...................................................................................

3.3 Clustered vs Unclustered Data.............................................................

3.3.1 Discussion...................................................................................

4 Conclusions

4.1.1 Future Recommendations...........................................................

References

18

18

21

24

26

27

28

29

30

31

33

35

36

37

38

39

V

Chapter I

Introduction

The idea of virtualization is well known in both academia and industry.

Virtualization is a technique for hiding the physical characteristics of computing

resources from the way in which other systems, applications, or end users interact

with those resources [1]. As computing becomes more and more distributed, the need

for having a pool of resources available transparently to the end user has become a de

facto requirement for every distributed system.

There are many different kinds of virtualization. The main idea is to make a single

physical resource appear to function as multiple logical resources or making multiple

physical resources appear as a single resource. In any case, the end user should not

bother whether the physical resource that utilizes resides in his personal environment

(personal computer) or anywhere in a network. Virtualization essentially lets one

computer to do the job of multiple computers, by sharing the resources of a single

computer across multiple environments.

This study explores ways of offering an online scalable database by implementing

virtualization in a database system. This work has been done in collaboration with

Xcalibre [2], a leading UK Hosting provider, which is the provider of all the required

resources. The solution proposed is the design and implementation of a virtualized

file system which is used by the database system to store its data. The main

advantage of developing a virtualized file system is its genericness in terms of

1

virtualization, as virtualization does not only concern database tables but common

files as well. The developed virtualized file system has the ability to interact with any

application, i.e. any database system that the end user selects. However, this work

has been tested with PostGreSQL [3], a well-established open source database

system.

1.1 Background

There are various virtualization technologies as the idea of the virtual machine firstly

appeared in the 1960s in the experimental IBM M44/44X system, in which the

operate system uses the computing machine to simulate multiple copies of the

machine itself [4]. In [5], different kinds of virtualization are presented which are

encountered in various IT environments.

1.1.1 Single to Many Virtualization

This category includes virtualization technologies where a single physical resource

appears to function as multiple logical resources. According to the resource that is

virtualized (such as a server, an operating system, an application, or storage device)

there are different types of virtualization:

● Operating System Virtualization: where multiple logical (or virtual)

operating systems (aka "guests") run on top of a fully functioning base (or

"host") operating system. This method of virtualization usually uses a

standard operating system such as Windows or Linux as the host, plus a

virtual machine manager, to run multiple guest operating systems. Some

vendors and products providing this type of virtualization include

Microsoft Virtual Server, SWSoft Virtuozzo, Parallels

Workstation/Desktop, Linux jails, and Sun Solaris containers.

● Server Virtualization (also known as "system virtualization" or

"native virtualization"): where multiple virtual operating systems run

2

directly on top of the hardware without an intervening operating system.

Typically, virtualization software will run directly on the base hardware,

and the operating systems will be installed onto that virtualization

software. So called "paravirtualization" is (arguably) a subset of server

virtualization that provides a thin interface to run between the base

hardware and a modified guest operating system. Examples of server

virtualization include VMware, ESX Server, and Xen. Server

virtualization facilitates a rapid – or even automatic – restart of

applications after a software failure. When used in conjunction with data

replication between data centres, it can restart applications at a recovery

site following a primary site failure.

● Application Virtualization: where an application is provided to the end-

user, generally from a remote location (such as a central server), without

needing to completely install this application on the user's local system.

Unlike traditional client-server operations, each user has an isolated, fully

functional application environment, sharing few if any components with

other users. Examples of this include Citrix Presentation Server, Thinstall

Virtualization Suite, and Altiris Software Virtualization Solution.

● Desktop Virtualization: where remote access to a complete desktop

environment allows access to any authorized application, regardless of

where the application is actually located. Examples of this include

Microsoft Terminal Services, VMware Virtual Desktop, and Kidaro

Managed Workspace.

● Software Streaming: essentially a subset of other virtualization

technologies that provides a way for software components - including

applications, desktops, and even complete operating systems - to be

dynamically delivered from a central location to the end-user over the

network. A user can start using streaming software before the entire

download has completed, much like video streaming without a complex

3

and lengthy installation process. Examples of this technology include

AppStream, Ardence (acquired in December 2006 by Citrix), and

Microsoft SoftGrid.

● Storage Virtualization: a way for many users or applications to access

storage without being concerned about where or how that storage is

physically located or managed. Typically storage virtualization applies to

larger SAN or NAS arrays, but it is just as accurately applied to the

logical partitioning of a local desktop hard drive. Examples include a

range of hardware, software, and appliance solutions from IBM, EMC,

Network Appliance, and others.

● Data Virtualization: Data virtualization abstracts the source of individual

data items – including entire files, database contents, document metadata,

messaging information, and more – and provides a common data access

layer for different data access methods – such as SQL, XML, JDBC, File

access, MQ, JMS, etc. This common data access layer interprets calls

from any application using a single protocol, and translates the application

request to the specific protocols required to store and retrieve data from

any supported data storage method. This allows applications to access

data with a single methodology, regardless of how or where the data is

actually stored.

Fig. 1.1 Data virtualization [4]

4

● Software as a Service (SAAS): is an implementation of virtualization, where

software is provided by an external application service provider (ASP),

normally on a usage basis. Typically, the end user will access the software

service through a Web browser and, in some cases, specialized software may

still be required. The complete software application is not hosted locally, or

even within the enterprise, but is hosted at a third-party service provider.

● Thin Client: is a local system that has limited or no independent processing,

storage, or peripherals of its own, relying entirely on a remote system for

virtually all operations. Typically, a thin client will have limited local

processing that allows it to merely perform I/O to a central server, which

hosts the operating system, desktop, and applications.

1.1.2 Many to Single Virtualization

This category includes virtualization technologies where many physical resources

appear as a single resource.

● Clustering: A cluster is a form of virtualization that makes several locally-

attached physical systems appear to the application and end users as a single

processing resource. A typical use case for clustering is to group a number of

identical physical servers to provide distributed processing power for high-

volume applications, or as a “Web farm”, which is a collection of Web servers

that can all handle load for a Web-based application.

● Grid Computing: Like a cluster, a grid provides a way to abstract multiple

physical servers from the application they are running. The major difference

is that the computing resources are normally spread out over a wide network,

potentially across the Internet, and the physical servers that comprise a grid

do not have to be identical. Unlike a cluster, where each server is locally

connected, is likely to be identical, and can handle the same processing

requirements, a grid is made up of heterogeneous systems, in diverse

5

locations, each of which may specialize in a particular processing capability.

Much greater coordination is needed to allocate the resources to appropriate

workloads.

1.2 Virtualization Benefits

The benefits of virtualization techniques are becoming more and more appealing

nowadays where the demand for high quality of service and error free systems is a

requirement to every system. Virtualization allows easier software migration,

including system backup and recovery, which makes it extremely valuable as a

Disaster Recovery (DR) or Business Continuity Planning (BCP) solution.

Virtualization can duplicate critical servers, so IT does not need to maintain

expensive physical duplicates of every piece of hardware for DR purposes. DR

systems can even run on dissimilar hardware. In addition, virtualization reduces

downtime for maintenance, as a virtual image can be migrated from one physical

device to another to maintain availability while maintenance is performed on the

original physical server. This applies equally to servers and desktops, or even mobile

devices – virtualization allows workers to remain productive and get back online

faster when their hardware fails.

Other benefits of virtualization include business agility and flexibility (virtualization

enables IT to respond to rapid on demand changes of the system), server

consolidation (improves server utilization by distributing the workload), reduced

downtime (virtual images are easier to restore after a failure), reduced software and

hardware costs.

1.3 Related Work

A lot of work has been done in operating system and server virtualization where

numerous software products are widely used. The first successful virtualization

software package was released by VMWare in the late 90s. VMware Workstation

6

allows users to run multiple instances of x86 or x86-64 -compatible operating

systems on a single physical PC, without any requirement of making changes to

processors or operating systems [6]. Other well known x86 virtualization products

are Parallels, Microsoft Virtual PC, QEMU+KQEMU, and VirtualBox. Most

virtualization environments enable the end user to run multiple operating systems

and multiple applications on the same computer at the same time, increasing the

utilization and flexibility of hardware [7].

1.3.1 Amazon EC2

Amazon Elastic Compute Cloud (Amazon EC2) [8] is a web service that provides

resizable compute capacity in the cloud. It is designed to make web-scale computing

easier for developers. The main idea of Amazon EC2 is to enable end users to use-

rent only the resources that they really need at any given point in time. The "Elastic"

nature of the service allows developers to instantly scale to meet spikes in traffic or

demand. When computing requirements unexpectedly change (up or down), Amazon

EC2 can instantly respond, meaning that developers have the ability to control how

many resources are in use at any given point in time. In contrast, traditional hosting

services generally provide a fixed number of resources for a fixed amount of time,

meaning that users have a limited ability to easily respond when their usage is

rapidly changing, is unpredictable, or is known to experience large peaks at various

intervals. The way that the end user interacts with EC2 is by running his application

on a virtual machine and by allowing him to select the desired memory, CPU, and

instance storage that is optimal for his application.

7

Fig. 1.2 Amazon EC2 end user web interface [8]

1.3.2 AppLogic

AppLogic [9] is a grid operating system designed to enable utility computing for web

applications. It uses advanced virtualization technologies to ensure complete

compatibility with existing operating systems, middleware and applications. As a

result, AppLogic makes it easy to move existing web applications onto a grid without

modifications.

Fig. 1.3 The architecture of AppLogic [9]

8

1.3.3 Xcalibre-FlexiScale

Following Amazon's EC2 idea FlexiScale [10], developed by XCalibre, is a web

service which enables the end user to utilize computing power on demand. The

virtual machine, called instance in EC2, is called Virtual Dedicated Server and

enables the user to specify his requirements in memory and storage capacity as well

as his desired operating system at any given point in time.

Fig. 1.4 FlexiScale’s virtual dedicated server [10]

The above web services are able to adjust their functionality to the needs of the end

user in terms of memory, CPU usage and storage capacity. Things become more

complex when the end user requires virtualization to be done on a database system.

9

1.4 Proposal-Project Aims

Meanwhile, no service offers an online scalable database. This study explores ways

of implementing virtualization in a database system. The foremost obstacle in

making a database system scalable is the mapping from the logical-virtual (user-

visible) address space to the physical address space. The end user needs not to have

any knowledge of the physical address where the tables of the database are actually

stored, as the physical address can change according to the demands in storage of the

user, leading to transparency. In other words, the database system should be totally

unaware of the underlying file system and where its data-tables are stored. The end

user may boot up or kill any server (by changing the overall available storage

capacity) at any time without being noticed by the database system.

With the standard virtual disks provisioned with each FlexiScale VDS, it is not

possible to mount the same disks in multiple servers simultaneously, or to have

multiple disks mounted to a single server. Each FlexiScale VDS can be considered as

an autonomous, independent machine with its own CPU, memory, operating system

and storage capacity. The main contribution of this work is to overcome this

limitation by enabling any service running on any of the VDSs that an end-user has

boot up to use any resource from any VDS that is in his disposal. The developed

virtualized file system enables data sharing between servers. To put it differently, the

end-user is aware of only a single mount point and totally unaware of where data is

actually stored, in which storage device, in which server.

At this point we should clarify the terms virtualization and transparency.

Virtualization makes a resource visible to the end user but this resource does not

really exist. On the other hand, a transparent resource exists physically but it is

invisible to the end user through a developed abstract layer. This study combines

virtualization with transparency. The virtual layer is responsible for making the user

to have a single view of the entire storage layer by providing him a common address

space, a single mount point. The user’s space of logical addresses is both virtual, in

10

the sense that it does not really exist, and transparent, as each storage disk has its

own memory addresses which are invisible to the user.

1.5 Specifications

The developed virtualized file system should meet the following specifications as

assigned by XCaliblre:

● high transparency to the end user, i.e., the end user is not aware of the

physical address that the data of any application-service (database) resides.

● on demand scale of the database without having the database server to be shut

down.

● generic approach suitable for any application and therefore any database

system.

● the system is responsible for adopting any change to the environment without

been noticed by the applications running.

As fas as the operating system on which the virtualized file system is built, Linux

meets our expectations as it is open source. However, it should be noted that the

virtualized file system could run on a virtual Linux operating system hosted by

Windows using any operating system virtualization software.

11

Chapter II

System Design

This chapter describes how FlexiScale handles the on demand scale of the system in

terms of storage capacity and various possible solutions to address the problem of

database virtualization. Finally, our approach and system design are presented

including the Graphical User Interface, as well.

2.1 Current System

For the time being, the user of FlexiScale can create a new Virtual Dedicated Server

in less than a minute, change the VDS parameters according to his requirements in

memory,operating system, and storage capacity on the fly and on demand, and

automatically recover from a physical server failure. When the user adds a new VDS,

a new autonomous server is created. The added VDS has its own storage system

which is completely independent from the storage system running on any other

currently in use VDS. In other words each VDS server has its own physical

namespace.

12

Fig.2.1 Current System: File_1 which resides in VDS 1

is different from File_1 in VDS 2

The absence of a single namespace raises limitations when the user wants to add a

new server which will use data from older servers. For instance, if a user has boot up

a VDS with a 40GB storage capacity (with 35GB used) and he requires to increase

his storage capacity to 60GB, a new VDS would be created with the desired storage

capacity. This will be a replica of the initial server (leaving 25GB of free space). This

means that it is not possible to just add a new VDS with 20GB storage capacity as

the new VDS will be unaware of the 35GB data stored in the initial server. Any

process running on the initial VDS should be stopped and should wait until the

replication procedure has terminated. Due to the absence of a global namespace, any

application running on any VDS cannot use the data stored on the other.

Fig. 2.2 Current System: Adding a new VDS

13

2.2 Possible Solutions

Possible solutions to the problem of database virtualization include either the

implementation of a virtual layer inside the database system that will be responsible

for mapping logical addresses used by the storage layer of the database system to

physical addresses where data resides, or the implementation of a virtual file system

in kernel or in user space. In the former case the way that the database interacts with

the underlying file system should be modified, whereas in the latter the underlying

database remains unchanged. This work concentrates on the latter case due to loss of

genericness of the former case as it will be shown shortly.

2.2.1 Modifying the Database System

In this approach the virtual layer is embedded in the database system. The embedded

virtual layer interacts with the storage layer of the database which can be distributed

to multiple VDS. Therefore, the database system except from executing queries is

responsible for making the mapping from logical to physical address.

User 1 User 2 User n...

File System

Storage

User’s logical address

Physical address

Disk 1 Disk 2

...

Disk n

PostGreSQL

Virtual Layer

Fig. 2.3 modifying the database system

The database engine knows where each table of the database is physically stored and

it appropriately map the user’s logical address. When the database system

14

(PostGreSQL) interacts with the user, it uses logical addresses, whereas when

interacting with the underlying file system it uses physical addresses.

The main drawback of this approach is that the resulting system would solely work

for the particular database system for which it would be implemented. Moreover, if

the database is not open source, any change to the underlying database would be

impossible. Therefore, modifying the database system would result in a static, non-

dynamic solution which would not meet the objectives of XCalibre.

2.2.2 File System in Kernel Space

Figure 2.4 depicts the Linux architecture in the most general terms [11]. The user-

level programs communicate with the kernel using system calls. When a user process

executes a system call, it changes its execution mode from user to kernel mode. In

kernel mode, while executing the system call, the process has access to the kernel

address space.

Fig. 2.4 Linux Architecture [11]

15

For the purposes of our work we will concentrate on device drivers, which are the

software interface to an I/O device. A device driver is a collection of subroutines

which are called when the kernel recognizes that a particular action should be taken

by a particular device [12]. A new file system can be implemented as a character

device driver either by recompiling the kernel and loading a new kernel image

containing the implemented device, or by loading the driver as a kernel module.

Fig. 2.5 Linux system and device driver relationship [12]

2.2.3 File System in User Space

Some years ago, before the advent of user space filesystems, filesystem development

was the job of the kernel developer. Creating filesystems required knowledge of

kernel programming and the kernel technologies (like vfs). Filesystem in Userspace

(FUSE) is a loadable kernel module for Unix-like computer operating systems, that

allows non-privileged users to create their own file systems without editing the

kernel code. This is achieved by running the file system code in user space, while the

FUSE module only provides a "bridge" to the actual kernel interfaces. FUSE was

officially merged into the mainstream Linux kernel tree in kernel version 2.6.14.

Released under the terms of the GNU General Public License and the GNU Lesser

General Public License, FUSE is free software. The FUSE system was originally part

of A Virtual Filesystem (AVFS), but has since split off into its own project on

16

SourceForge.net. FUSE is available for Linux, FreeBSD, NetBSD (as PUFFS),

OpenSolaris and Mac OS X [13].

FUSE is particularly useful for writing virtual file systems. Unlike traditional

filesystems, which essentially save data to and retrieve data from disk, virtual

filesystems do not actually store data themselves. They act as a view or translation of

an existing filesystem or storage device. In principle, any resource available to a

FUSE implementation can be exported as a file system.

With FUSE [14] it is possible to implement a fully functional filesystem as a

userspace program with the following features:

● Simple library API

● Simple installation (no need to patch or recompile the kernel)

● Secure implementation

● Userspace - kernel interface is very efficient

● Usable by non privileged users

● Runs on Linux kernels 2.4.X and 2.6.X

● Has proven very stable over time

Figure 2.6 shows the path of a filesystem call (e.g. stat). The FUSE kernel module

and the FUSE library communicate via a special file descriptor which is obtained by

opening /dev/fuse. This file can be opened multiple times, and the obtained file

descriptor is passed to the mount syscall, to match up the descriptor with the

mounted filesystem.

17

Fig. 2.6 FUSE flow-chart diagram [14]

2.3 Our approach

2.3.1 Overview

The key idea of our approach is the implementation of file virtualization which

enables multiple physical storage devices with their own physical address namespace

to be treated as a single virtual storage device with a global namespace. Each

application, no matter on which VDS is run, is able to use (read,modify,delete) any

data created by any application run on any VDS actually stored anywhere. Each VDS

has a single mount point which enables its applications to access data located on

different storage devices on other VDS.

18

Fig. 2.7 Global namespace

In Figure 2.7 the user has boot up two VDSs with a 20 GB storage capacity each. An

application running on VDS 1 requests to read File_1. The application is totally

unaware of where File_1 is actually stored. The application passes its request to the

virtual filesystem where the mapping from virtual to physical address is done. The

virtual filesystem returns to the application the requested file which is actually stored

on a different VDS (VDS 2) that the application is run. Moreover, it should be noted

that every application running on a ny VDS considers that the overall capacity of the

system is 40 GB (i.e. the sum of the storage capacity of the running VDS).

As shown in Figure 2.8, files can be moved from one server to another without

affecting file paths which allows for ease in accomplishing load balance to the

underlying storage devices. The virtual address of File_1 is fixed while the virtual

address changes and no change should be done in the application using this file. Our

approach succeeds in making the various VDS which are available to the user to

appear as utilizing a single file system.

19

Fig. 2.8 Moving files from one VDS to another

This approach treats the database system as an ordinary application which stores its

data to the virtual filesystem. The virtual filesystem handles every file in the same

way no matter if it contains tuples or other data.

Fig. 2.9 PostgreGreSQL running and data distribution

The implemented file system utilizes GlusterFS [15], a free software released under

GNU GPL v3 license, which is a FUSE project. GlusterFS is a clustered file-system

which aggregates various storage bricks (storage node) over Infiniband RDMA or

TCP/IP interconnect into one large parallel network file system. GlusterFS uses

20

translators which are binary shared objects (.so) loaded at run-time. The idea of

translators is borrowed from the GNU/Hurd [16] operating system. A translator is a

program that it is inserted between the actual content of a file and the user accessing

this file and processes the incoming requests in many different ways. From the

kernel's point of view, translators are just another user process (run in user space).

Figure 2.10 depicts a system with two storage nodes running on the same machine

(localhost).

Fig. 2.10 GlusterFS: server and client volume specification

files for two bricks running on localhost [15]

2.3.2 Our Implementation in Greater Detail

This section describes our implementation in greater detail and concentrates on how

the implemented file system handles the files created by PostGreSQL. However, as

we have already remarked, any application is handled in the same manner.

PostgreSQL is running as a service to a single VDS, referred to as main VDS. It

should be noted that all VDSs are equivalent and therefore any VDS could play the

21

role of the main VDS. Each VDS is a brick according to GlusterFS terminology and

communicates with other bricks over TCP/IP. The IP address of the main VDS is

92.60.121.114. XCalibre supplied us with five VDSs in total (92.60.121.114/118).

A brick is a server-brick when it accepts and stores files which are created by

processes running on other bricks, the clients. Each VDS-brick reacts at the same

time as both a server and client brick to other VDSs allowing files created by any

VDS to reside in any available VDS. As it will be shown shortly, this property is

significant when a server is removed.

As we have already said, the virtual file system is treated by each VDS as a single

filesystem with a global namespace. Therefore, it could be possible that two different

VDSs would create simultaneously a file with the same name. Such a thing would be

disastrous for any filesystem. In order to overcome such an unpleasant situation, the

main VDS, except from running PostGreSQL, is responsible for keeping track of

every file name that is already in use. When a VDS creates a new file it should

acquire the lock (semaphores are used) of the directory in which it desires to store the

file, and only then carry out the creation operation. The global namespace is located

in ~/gf_exports/export-namespace directory.

There is a single mount point (~/mount) available to each VDS which can been

considered as a gate to the virtual filesystem, as it connects the running VDSs.

Listing the files of the mount directory will return all files that are handled by the

virtual filesystem and not just the files that reside in the specific VDS. In other

words, an ls command on the mount directory returns the same result no matter on

which VDS is run. The information of which files reside on a VDS is located in

~/gf_exports/export. To be more precise, all files are actually stored in an export

directory. The mount directory is just an image of all exported directories (we can

consider them as soft links to the mount directory) of the running VDSs. In other

words, if the user unmounts the mount point and then performs an ls in the directory,

zero files will be returned. When it will be re-mounted, all data will be in the mount

directory.

22

Fig. 2.11 Configuration of filesystem

In the default configuration of GlusterFS, each brick acts as a client or as a server to

other bricks. Following this naïve approach, it would result in a system that a VDS

(the main) would only be able to store files on others VDSs. Consequently, the

running VDS would not be equivalent, as only the main VDS could run processes

that would store data to other VDSs. In other words, all mount directories located to

server bricks (except from the main VDS) would be read-only directories. However,

such a system would be able to provide the required virtualized filesystem to the

database system, as it runs on a single VDS. On one hand the main benefit of this

approach is its simplicity and clarity of how the files are distributed across the VDSs.

Listing the mount directory would return only files located in this particular brick

and not in the whole filesystem. On the other hand, the main drawback comes when a

VDS should be removed,i.e., the database shrinks. Any files stored in the removed

VDS should be transferred to the remaining VDSs. In the naïve approach there is no

way to send files to the other VDSs as the mount directory is read only. Therefore,

files should be sent through a new connection not using the virtualized filesystem

(for instance, through ftp or ssh). In our implementation each brick is both a client

and a server at the same time. When a server is selected for removal all its files are

re-distributed through the virtualized filesystem resulting in better performance.

23

When a new VDS is added to the system there are two options available: i) do not

distribute any existing files and consider the newly added VDS only for the

following created files, or ii) re-distribute all files. In the former case, the recently

added VDS will not host any old files. Such a system will be useful to a user if

he/she would like to have an extra VDS for a small amount of time, may be for

debugging reasons. For instance, the user would like to test the usage of a new index

structure in PostGreSQL. He/She could add a new VDS to host PostGreSQL's files in

order to do his/her evaluation tests and then remove the VDS without affecting the

older system. However, in our implementation the latter case has been adopted; all

existing files are re-distributed which results in load balanced VDSs.

2.4 Graphical User Interface

Figure 2.12 is a screen shot of the Control Panel from which the end user can manage

the storage servers. The end user has complete control on the VDS that are on his

disposal and can manage them by a single click. He/she can start the virtual file

system by specifying the number of VDS that he/she wants to include and the

scheduler which sets how the created files will be distributed to the underlying

servers. The options that are available concerning the scheduler are:

● round-robin (rr): creates files in a round-robin fashion. Each VDS has its own

round-robin loop. This scheduler is a good choice when files are mostly

similar in size which will result in load balanced VDS.

● random: files are stored randomly to any VDS.

● nufa: Non-Uniform Filesystem Scheduler gives the local system more priority

for file creation over other VDS. If there is enough available space, files are

stored locally (to the creator VDS). On the contrary, round robin fashion is

followed to the remaining VDS.

24

Fig. 2.12 GUI-Control Panel

The end user can select to add or remove any VDS from the virtual filesystem at any

time on the fly. The applications are not disrupted while the re-distribution of files

take place. However, it is possible for the time period that it takes for the re-

configuration to have affect a required file by a running application to be unavailable.

Moreover, it is possible for the end user to get details about the running servers and

locate where each file is actually stored, i.e. on which VDS. Finally, the end user can

stop and restart any server.

25

Chapter III

Evaluation

This chapter presents the evaluation of the implemented filesystem when deployed

in FlexiScale. We concentrate on the overhead that the virtualized filesystem imposes

to the system when PostGreSQL is running. Moreover, we measure the time that it

takes to add and remove a VDS to the developed system. Finally, we examine if

there is any impact on the performance of the database according to whether the files

created by the database are clustered to a VDS or not. Table I depicts the system used

for these experiments. PostGreSQL is running as a service to the main VDS

(92.60.121.114) and all VDSs have the same characteristics.

TABLE I

System Characteristics

CPU Dual-Core AMD Opteron(tm)

Processor 8220

Memory 512 MB

Diskspace 20 GB

Operating System Ubuntu 8.04 LTS

Database PostGreSQL 8.3.3

26

3.1. Measuring TPS

In order to evaluate the performance of the developed filesystem we measured how

many transactions per second (TPS) are performed by PostGreSQL when it runs in

various number of storage servers (VDSs) and executing various query scenarios. We

used the pdbench benchmark [17] to generate source data. PostGreSQL is shipped

with pgbench which is its standard tool for measurements. pgbench runs the same

sequence of SQL commands over and over, possibly in multiple concurrent database

sessions, and then calculates the average transaction rate (transactions per second).

In our experiments we ran a scenario that is loosely based on TPC-B, involving five

SELECT, UPDATE, and INSERT commands per transaction and a simpler scenario

that SELECT and INSERT commands are issued. Table II shows the tables of the

database and Table III the transaction script run by every transaction. In the select

scenario the UPDATE commands are not executed. Before executing any

experiment, the TRUNCATE operation takes place in order to free any unused pages

from the buffer pool. The overall amount of transactions was set to 100000 and

different experiments were performed with various scaling factors and number of

clients. When the scaling factor is set 1, 10 ,100 the total amount of tuples residing in

the accounts table is 100,000, 1,000,000, and 10,000,000 respectively.

TABLE II

Tables Used

Table accounts

Column Type

aid integer

bid integer

abalance integer

filler character(84)

Indexes:"accounts_pkey

" PRIMARY KEY,

btree (aid)

Table branches

Column Type

bid integer

bbalance integer


Indexes:"branches_pkey

" PRIMARY KEY, btree

(bid)

Table tellers

Column Type

tid integer

bid integer

tbalance integer


Indexes: "tellers_pkey"

PRIMARY KEY, btree

(tid)

Table history

Column Type

tid integer

bid integer

aid integer

delta integer

mtime timestamp


27

TABLE III

Transaction Script

BEGIN;

UPDATE accounts SET abalance = abalance + :delta WHERE aid = :aid;

SELECT abalance FROM accounts WHERE aid = :aid;

UPDATE tellers SET tbalance = tbalance + :delta WHERE tid = :tid;

UPDATE branches SET bbalance = bbalance + :delta WHERE bid = :bid;

INSERT INTO history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :delta, CURRENT_TIMESTAMP);

END;

3.1.1 TPC-B

Figure 3.1 shows the results obtained when we run the TPC-B experiment on

different number of servers. The blue column represents the results obtained when

the experiment was run on localhost without our filesystem deployed. As it was

expected, the values of blue columns are greater in all cases as the distribution of

files through the implemented filesystem imposes overhead as the requested files

from PostGreSQL are not soleley reside in localhost. The other columns show the

TPS when up to five storage servers (VDS) are used. In the following charts, S

shows the selected scaling factor and C the number of concurrent clients for each

experiment

28

Fig. 3.1 TPC-B results

We observe that the deployment of the virtualized filesystem to the system results in

decreasing the system response by 23%,when the scale factor is set to 1 with a sinlge

client. Moreover, we notice that the performance of the system is not affected from

the insertion of new VDSs (the variation of TPS with 2, 3, 4, and 5 servers is less

than 5%). When the scaling factor is set to 10 (1,000,000 tuples in account table)

with 10 clients, TPS is decreased by 24%, whereas with a scaling factor of 100

(10,000,000) with 100 clients the decrease is equal to 25%. We observe that while the

number of clients increases and the database becomes larger the overhead that is

imposed to the system can be considered constant.

3.1.2 Selections Only

Figure 3.2 shows the results obtained when we ran experiments selections only on

different number of servers. In these experiments lines 4 and 5 from the script shown

in Table III are not executed. The transaction script is simpler compared to the TPC-

B experiments resulting in larger TPS values. When the scaling factor is set to 1 with

29

s=1 c=1 S=10 c=10 S=100 c=100

0

20

40

60

80

100

120

140

160

180

200188.02

109.96

43.85

150.98

81.78

37.61

TPC-B

1 server2 servers3 servers4 servers5 servers

TP

S

one client, the penalty in performance is less than 11%. As the experiment scenarios

are getting more complex, the overhead that we witness is increasing (55% TPS

decrease for a scaling factor equal to 10, and 75% for scaling factor equal to 100).

Fig. 3.2 Selections Only results

3.1.3 Discussion

The previous experiments showed us the impact of the deployment of our filesystem

with various number of servers and different workloads has on PostGreSQL. All

experiments showed us that the greatest performance penalty occurs when the

virtualized filesystem is inserted to the system,i.e., when adding one extra server to

localhost. PostGreSQL's performance remains steady no matter how many servers

will be used. Moreover, it is observed that the overhead imposed is constant in TPC-

B like transactions (almost 22%) no matter how many clients are concurrently

executing their queries. On the other hand, in simpler queries like the Selections

Only experiments, the performance penalty increases with the number of clients.

30

s=1 c=1 S=10 c=10 S=100 c=100

0

200

400

600

800

1000

1200

14001274.34

802.54

282.97

1240.99

361.53

76.58

Selections Only

1 server2 servers3 servers4 servers5 servers

TP

S

Fig. 3.3 % decrease of TPS when deploying our

filesystem to PostGreSQL

The obtained results shown in Figure 3.3 are due to the different workload posed to

PostGreSQL. TPC-B experiments are more complex and include many changes to

the files (perform INSERTION commands) used by the database system, whereas

Selections Only experiments execute read-only operations. Therefore, PostGreSQL is

the main bottleneck as scenarios become more complex and the main overhead is

posed by PostGreSQL rather than the implemented file system.

3.2. Adding and Removing a server

As we have already seen, the end user can easily add or remove a server through the

developed control panel. The following experiments measure the response time of

the system in such cases. In order to evaluate the performance of the system when the

end user adds a new server, we measured the time that it takes for the new server to

be added and the time that it takes to redistribute the data. The time that it takes to

start up a new server is irrelevant from the disk usage of the filesystem. That is,

starting up a new server takes constant time no matter how many files are hosted by

the filesystem. On the other hand, the redistribution of data is strongly affected by the

31

TPC-B

s=1 c=1

S=10 c=10

S=100 c=100

Selections Only

s=1 c=1

S=10 c=10

S=100 c=100

0 10 20 30 40 50 60 70 80

% overhead

inserting one extra server to localhost

% decrease of TPS

storage usage.

In order to measure the impact that the overall size of the hosted files has to the

redistribution of data, different workloads were used. We created a 4MB file which

we replicated 256, 512 and 768 times in order to achieve overall storage usage of 1,

2, and 3 GB respectively. Initially, all files reside in the main VDS. The user adds a

new VDS, redistributes the files (half of the files reside in the main VDS and half in

the newly added) and finally, removes the newly added VDS and all files return to

the main VDS. Figure 3.4 depicts the obtained results.

Fig. 3.4 delay when adding a server

Starting up a new server takes less than 8 seconds despite the disk usage. When the

user selects to re-distribute the data to the running servers two operations take place:

i) moving all files to a new directory (which resides in the filesystem) and ii) sending

the appropriate files to the other servers. Due to the round-robin scheduler used in

these experiments half of the files will be sent to the newly added server. The time

that it takes to move the files is affected by the characteristics of the system,i.e., cpu

speed and available memory, whereas the time that it takes to send the right files to

the right server depends on the available network speed.

32

1GB 2GB 3GB

0

100

200

300

400

500

600

700

800

Adding a Server

distribute

mv filesstart up server

time

(se

c)

Fig. 3.5 delay when removing a server

Figure 3.5 shows the obtained results when removing a server. During the

distribution of files only the files that reside in the removed server should really be

moved and transferred to the remaining server. Due to the fact that we have only two

servers with round-robin scheduler half of the files should be really moved. From

Figures 3.4 and 3.5 we can have a rough estimation of the available bandwidth

between the servers which is 40 Mbps. It should be noted that sending a 1GB file

through scp from one server to the other takes 2 minutes and 20 seconds, i.e.,

60Mbps. Typical Ethernet speeds are 10/100/1000 Mbps.

3.2.1 Discussion

As we saw, the time that it takes to add a new server is the sum of the needed time to

start up the server (less than 8 seconds) and the time that it takes to redistribute the

files. However, we should note that if the end user does not require load balanced

servers, i.e., when a new server is added only the following files will be stored in it,

then the overall time that it takes to add a new server is just the starting-up time.

33

1GB 2GB 3GB

0

50

100

150

200

250

300

350

400

450

500

Removing a Server

shut down serverdistribute

time

(se

c)

As fas as XCalibre's current system is concerned, when the end user requires to

increase his overall storage capacity, the old server is replicated to a new server with

the desired capacity. Therefore, in the current configuration of FlexiScale, the end

user has only one server with the overall desired storage capacity. Making a clone of

the old server and starting up the new server is succeeded through a NetApp's storage

system which takes less than 5 seconds. Cloning a LUN (Logical Unit Number,

unique identifier used to distinguish several devices that share the same bus) does not

mean that every data is moved to the new LUN. It will only have a reference point to

the master LUN. Each time a file is changed in the new LUN, the master data will be

copied to the new LUN. In the NetApp level there will be no synchronization and

therefore, the master will have separate data set from the clone. In other words, the

movement of the files does not occur at once but gradually as files are updated.

Making a new clone is treated by changing reference point to the master LUN. Our

approach does not involve any hardware. It is a pure software solution which can be

applied to any computer network.

At this point, we should clarify that adding a new server results in the increase of the

overall storage capacity. As we have already stated, there is a single database server

running on the VDS server. Running a new instance of the database system on

another server belongs to the domain of distributed database systems and it is out of

the scope of this work. Having more than one database servers running

simultaneously and using the same data can be only accomplished through a

distributed database architecture where each database instance should communicate

with each other. It is not possible to implement a distributed database system without

altering the database system and by only sharing data.

34

3.3. Clustered vs Unclustered

In this section we examine if there is any difference in the performance of the

database system when files are clustered in the same server. We created four

instances of the same database. The database schema is the same as shown in

paragraph 3.1. The scaling factor was set to 10 resulting in 1,000,000 tuples in the

account table and the number of concurrent clients was 10, executing 10,000 times

the transaction script of Table III.

In the first experiment the various files of the database instances were distributed to

the system (unclustered). Whereas, in the second experiment all files are clustered to

a server. That is, all files of the first instance reside on the main VDS, all files of the

second instance reside on the second server, and so forth.

Fig. 3.6 TPC-B results for clustered and unclustered data

Figure 3.6 shows the obtained result when we ran TPC-B experiment for clustered

and unclustered data. We can see that there is significant difference in the

performance of the database system only for the first database instance (database 1).

This behaviour can be explained due to the fact that the database server runs on the

first server where the files of the first database reside.

35

1 2 3 4

0

20

40

60

80

100

120

TPC-B

Clustered

Unclustered

Database

TPS

Fig. 3.7 Selections Only results for clustered and unclustered data

Figure 3.7 shows the obtained result when we ran Selections Only experiment for

clustered and unclustered data. As it was expected, there is significant difference in

the performance of the database system only for the first database instance (database

1).

3.3.1 Discussion

The previous experiments showed us, that there is no apparent change in the

performance of the database system whether the database files are clustered to a VDS

or not. That means, that the implemented filesystem does not pose any overhead to

the database system. The database system still remains the bottleneck of the system

and there is no significant change to the performance of the system according to how

the files are distributed to the various servers. However, these results verified the

results obtained from the experiments of paragraph 3.1 and proved the expected:

“Having the database files in the same server where the database is running is always

faster than having the files distributed”.

36

1 2 3 4

0

100

200

300

400

500

600

700

800

900

Selections Only

ClusteredUnclustered

Database

TP

S

Chapter IV

Conclusions

Virtualization is a technology that is rapidly transforming the IT landscape and

fundamentally changing the way that people compute. In essence, virtualization is a

software layer which enables sharing of hardware resources. The benefits of

virtualization are becoming more and more appealing our days where the demand for

high quality of service and error free systems is a requirement to every system. A lot

of work has been done in the domain of virtual servers but database virtualization is

still an open area. The way that a database system utilizes the underlying file system

makes things more complex.

This work tried to solve an existing problem to Xcalibre's FlexiScale software:

provide a transparent online scalable database system. The developed virtualized

system treats the database system as an ordinary application running on a VDS and it

succeeded in meeting the main requirements of XCaliblre:

● high transparency to the end user, i.e., the end user is not aware of the

physical address that the data of any application-service (database) resides.

Any application running on any VDS can use files which reside in different

VDSs.

● on demand scaling of the database without having to shut down the database

server. There is no disruption of the running applications while a server is

37

added or removed by the system.

● generic approach suitable for any application and therefore any database

system.

● friendly graphical user interface which enables the user to control the various

servers.

The results obtained from the evaluation of the developed system are promising and

show that a solution to the problem of database virtualization can be accomplished

by implementing a virtualized filesystem, i.e., provide a global namespace by

unifying the storage disks of the running VDSs.

4.1. Future Recommendations

The granularity of the developed system is the file. That is, it uses and treats files as

an indivisible entity. It is not able to process the various pages which form a file.

However, most applications use files except from database systems which can also

retrieve only a specific page of an entire file. The development of a FUSE project

which will be able to process pages except from files would be of great importance to

the database world. FUSE is a powerful tool that enables each user to implement his

own filesystem according to its needs. There is no longer need for a database system

to by-pass the underlying filesystem but it should instead use an implemented

filesystem suitable for its requirements in order to achieve better performance.

38

References

1. Brad, A. (2007), Power Support in a Virtualization Environment [White paper], retrieved from MGE Office Protection Systems website: www.mgeops.com/index.php/content/download/1589/19407/file/MGEOPS_VirtualizationFINAL.pdf.

2. XCalibre Home Page, (2007), in XCalibre, retrieved 4 August 2008 from http://www.xcalibre.co.uk.

3. Stonebraker, M. and Kemnitz, G. (1991), “The POSTGRES next generation database management system”, Jounral Communication ACM, 34(10), pp. 78-92.

4. Creasy,R.,J. (1981), “The origin of the VM/370 time-sharing system”, IBM Journal of Research & Development, 25(5), pp. 483-490.

5. Mann, A. (2007), Virtualization 101 [White paper], retrieved from Enterprise Management Associates (EMA) website: http://www.emausa.com/ema_lead.php?ls=virtwpws0806&bs=virtwp0806.

6. Vmware, (2008), in Wikipedia, the free encyclopedia, retrieved 4 August 2008, from http://en.wikipedia.org/wiki/VMware.

7. Virtualization Basics, (2008), in vmware, retrieved 4 August 2008, from http://www.vmware.com/virtualization.

8. Amazon Elastic Compute Cloud (Amazon EC2) – Beta, (2008), in amazon web services, retrieved 4 August 2008, from http://www.amazon.com/gp/browse.html?node=201590011.

9. AppLogic - Grid Operating System for Web Applications, (2008), in Applogic, retrieved 4 August 2008 from http://www.3tera.com/applogic.html.

10. How FlexiScale works, (2008), in FlexiScale, retrieved 4 August 2008 from http://www.flexiscale.com.

39

11. Bach, M., J., The Design of the Unix Operating System, New Jersey, Prentice Hall, 1986.

12. Coffey, T. and O'Shaughnesssy, A., Write a Linux Hardware Device, retrieved from Network Computing website: http://www.networkcomputing.com/unixworld/tutorial/010/010.txt.html.

13. Filesystem in Userspace, (2008), in Wikipedia, the free encyclopedia, retrieved 4 August 2008, from http://en.wikipedia.org/wiki/Filesystem_in_Userspace.

14. FUSE Home Page, (2008), in FUSE, retrieved 6 August 2008, from http://fuse.sourceforge.net.

15. Gluster Storage System, (2008), in Gluster, retrieved 6 August 2008, from http://www.gluster.org.

16. The GNU/Hurd User's Guide, (2008), in HURD, retrieved 6 August 2008, http://hurd.gnu.org.

17. PostgreSQL 8.4devel Documentation, (2008), in PostGreSQL, retrieved 6 August 2008, from http://developer.postgresql.org/pgdocs/postgres/pgbench.html.

40

memory virtualization in database systems

Technology

single virtualization

idea of virtualization

virtualization benefits

abstract virtualization

file system

distributed system

system design

current system