a study of an in-memory database system for real-time ... · services, such as twitter, now provide...

A study of an in-memory database system for real-timeanalytics on semi-structured data streams

by

Alan Wen Jun Lu

A thesis submitted in conformity with the requirementsfor the degree of Master of Applied Science

Graduate Department of Department of Electrical and ComputerEngineering

University of Toronto

c© Copyright 2015 by Alan Wen Jun Lu

Abstract

A study of an in-memory database system for real-time analytics on semi-structured

data streams

Alan Wen Jun Lu

Master of Applied Science

Graduate Department of Department of Electrical and Computer Engineering

University of Toronto

2015

Recently there have been increasing demands for real-time analytics on rapidly changing

data and for databases to effectively support mixed OLTP and OLAP workloads. In-

memory databases provide a promising paradigm for these applications. However, due

to the rapid emergence of various types of semi-structured data, one key challenge for

in-memory databases is the data layout. In this dissertation, we developed an in-memory

database system that dynamically partitions its tables vertically based on workload char-

acteristics to achieve fast querying speed on semi-structured data. We produced a set

of guidelines on vertical partitioning in-memory data in different situations and showed

that our approach can outperform traditional columnar and row-based storage methods,

as well as an alternative data structure, ARGO, that was recently developed to enable

JSON storage in relational databases. We also showed that our system has advantages

over a system with similar idea, called Hyrise, in partitioning adaptability.

ii

Dedication

Acknowledgements

I would like to express my special appreciation and thanks to my adviser Profes-

sor Cristiana Amza. Her knowledge, guidance and vision were very important to me

throughout my graduate school study. It was an honor and privilege to be supervised by

Professor Amza.

I would also like to thank my examination committee members: Professor Ben Liang,

Professor Nick Koudas, and Professor Baochun Li for their time, comments and valuable

feedback.

I would also like to thank my colleagues and lab mates: Dr. Ali Hashemi, Dr. Jin

Chen, Sahel Sharifymoghaddam, and Mihai Burcea, for their tremendous support and

for sharing their knowledge. I would like to especially thank Sahel Sharifymoghaddam

for her close collaboration, support, and suggestions along the research journey.

I would also like to thank Cindy QianLi Yao for her support, encouragement, and

understanding.

Last but not least, I would like to thank my family, especially my parents Li Xiong

Huang and Dan Boan, for their love and support.

iii

Contents

1 Introduction 1

2 Background 8

2.1 The Need for Real-Time Analytics . . . . . . . . . . . . . . . . . . . . . 8

2.2 Reasons for In-Memory Databases . . . . . . . . . . . . . . . . . . . . . . 11

2.3 Memory Hierarchies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12

2.4 Column-Based, Row-Based, and Vertical Partitioned In-Memory Data

Layout . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

2.5 Workload and Data Characteristics . . . . . . . . . . . . . . . . . . . . . 18

2.5.1 NoBench: A JSON Micro-Benchmark . . . . . . . . . . . . . . . . 20

3 Design and Implementation 23

3.1 Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 23

3.2 In-Memory Data Structure . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.3 Query Execution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.1 Table Creation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3.2 Data Insert . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.3 Select . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.4 Inner Join . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33

3.4 ARGO: Alternative Data Structure for JSON in Relational Format Used

for Comparison . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

iv

3.4.1 ARGO Concepts . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.2 Baseline Comparison with ARGO . . . . . . . . . . . . . . . . . . 36

3.5 Partitioner . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5.1 Initial Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 38

3.5.2 Cost Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 39

3.5.3 Parameter Alpha: Fine-Grain vs. Coarse-Grain Partitioning . . . 41

4 Vertical Partitioning Characterization 43

4.1 Case Study 1: Co-Location of Attributes Accessed Together . . . . . . . 43

4.1.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 44

4.1.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.1.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49

4.2 Case Study 2: Where Attribute Alone vs. Together With Select Attributes 49

4.2.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 50

4.2.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51

4.2.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 53

4.3 Case Study 3: Sparse Attributes Separated and Fragmented vs. Together

with Non-Sparse Attributes . . . . . . . . . . . . . . . . . . . . . . . . . 54

4.3.1 Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.3 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.4 Guidelines in Partitioning . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Dynamic Vertical Partitioning in Multi-Type Workloads 62

5.1 Evaluation Methodology . . . . . . . . . . . . . . . . . . . . . . . . . . . 63

5.2 Results . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.2.1 Sensitivity to Alpha Value in the Partitioning Algorithm . . . . . 67

5.2.2 Detailed Study of one Experiment Out of 150 . . . . . . . . . . . 69

v

5.2.3 Evaluation of Dynamic Adaptation . . . . . . . . . . . . . . . . . 71

6 Related Work 73

6.1 In-memory Relational Databases . . . . . . . . . . . . . . . . . . . . . . 73

6.2 In-memory Relational Databases with Vertical Partitioning . . . . . . . . 75

7 Conclusion and Future Work 76

Bibliography 78

vi

Chapter 1

Introduction

In recent years, there have been increasing demands for real-time analytics that provide

up-to-the-minute reporting on various types of dynamic and rapidly changing data. One

area of real-time analytics applications that has emerged is in the area of trend analysis

done by major technology companies. For example, Facebook needed a system that

would allow real-time analyzes on different events occurring on its site to ensure the

timely diagnosis of issues so that any issues can be addressed in a timely manner. If

Facebook cannot do that, millions of its users would be affected and that is ultimately

detrimental for Facebook’s business. Recently, Facebook developed its own in-memory

database system, called Scuba, for real-time performance monitoring, trend analysis, and

pattern mining purposes [1]. According to Facebook’s publication on Scuba, one example

of how this system is used inside Facebook is as follows: an employee at Facebook could

use the Scuba system to monitor various performance metrics of the website, including

CPU load on servers, cache hits and misses, network throughput, etc., and use these

to see if there are any major performance changes, especially after big code changes.

The employee can then drill down on specific columns of the data to pin point issues in

case of major performance abnormalities. The queries used by the employee could be

run over data that are only several seconds old. While it is impossible to analyze all of

1

Chapter 1. Introduction 2

Facebook’s data in a real-time fashion, applications like Facebook’s Scuba system can be

used with sampled data that can be fitted into the total memory of a cluster of servers.

This type of performance requirements, originally encountered by Internet companies

such as Amazon, Google, Facebook and Twitter, is becoming a challenge for many other

companies who now also want to provide meaningful real-time services [20]. For example,

trading companies also need to react to sudden changes in market conditions in a timely

manner [20].

We also expect that, in the future, there will be increasing needs from third party

companies and applications for data management systems that can analyze data that

are coming from these large company in a real-time manner. For example, many web

services, such as Twitter, now provide streaming APIs to provide third parties with a

constant flow of data that can be stored and analyzed in real-time [18]. Systems that can

store these data and offer very fast query responses can be very valuable to these third

party applications.

Other than real-time analytics, there is also a need for modern database systems to

effectively support mixed OLTP (Online Transactional Processing) and OLAP (Online

Transactional Processing) workloads. For example, many enterprise applications need

Available To Promise (ATP) workload to determine if a purchase order can be fulfilled

by running OLTP-style queries to update stock levels while processing OLAP-style query

to determine if there are enough stocks or resources to fulfill the order [11]. In the

past, database systems are divided into OLTP or OLAP workloads, and companies that

run applications in both areas would have two separate systems for these two kinds of

workloads. In such arrangements, transactional processing is performed on a dedicated

OLTP database system. A separate Data Warehouse is installed separately for analytical

query processing. Periodically, data from the OLTP database system are extracted,

transformed into a format suitable for the data warehouse system, and loaded into the

data warehouse for future usage [9]. The problem with this setup, however, is data


staleness such that data analytics application often do not run on the most updated

information available [17], presenting a bottleneck in real-time data analysis that is hard

to overcome [6].

In-memory database systems provide a promising paradigm for the above-mentioned

applications. In recent years, in-memory database systems have gained a lot of popularity

mainly due to 1) the faster access to main memory compared to disk access [16], and 2)

main memory in computers is getting faster and cheaper with increasing size over the

past decade [20]. For example, its price has been dropping by a factor of 10 every five

years and its capacity has been doubling every three years [20]. Therefore, in-memory

databases have gained a lot of interest, especially since faster processing time enables

analysis on a large volume of data in real-time [20].

As its name suggests, the main idea of an in-memory database systems is to store the

primary dataset in the main memory of the computer, instead of on hard disk as was

done traditionally, in hope of taking the advantage of faster access to memory than to

disk. At first glance, it might seem counter-intuitive that an entire dataset could be fit

into the memory of one or more machines because we are in the era of data explosion

and, after all, the size of memory in modern machines is still significantly smaller than

disks. However, past studies have actually shown the feasibility of in-memory databases

in real-life situations. A study showed that the vast majority of the jobs (96%) in the Hive

warehouses at Facebook only required input data that could be fit into the cluster’s total

main memory [4]. As in the Scuba example, coupled with techniques such as sampling and

aging of old data, many databases can be fit into memory and still be able to provide

important insights. In addition, in-memory database systems have been shown to be

effective in serving workloads in mixed OLTP (Online Transactional Processing)/OLAP

(Online Transactional Processing) settings, given the correct arrangement of data within

the system [11].

The goal of our research project is to develop an in-memory database


system that intelligently and dynamically partitions its tables vertically based

on known workload characteristics to achieve fast querying speed for real-time

analytics on semi-structured data streams.

Our system is among a number of recently developed in-memory database systems

for real-time data analytics [9, 1, 13, 4, 7, 10]. The novelty in our design compared to

these other systems is in our focus on data layout in the memory. Most of these existing

systems have either a pure column-based or row-based approach when storing its data

in tables. A columnar layout is usually used for analytical workloads that access and

aggregate a large volume of records on a small set of dimensions, while a row-based

layout is more suitable for transactional workloads that access many attributes on some

particular records [6]. Having a pure columnar or row-based approach would miss the

potential to use a hybrid strategy to fully exploit the potential of in-memory system for

certain workloads with mixed characteristics. Specifically, in reality, certain attributes are

frequently accessed together by certain workloads, and storing these attributes closely in

main memory can in principal lead to better cache locality and better query performance.

To the best of our knowledge, Hyrise [6] is the only one of the existing systems that

has similar idea as our system. Like our system, Hyrise partitions tables into vertical

partitions depending on what attributes are accessed together and on building a detail

cache model. However, Hyrise has two shortcomings that our system addresses. First,

Hyrise depends on detailed cache modelling, hence, it has significant overhead when the

number of attributes is large, which is usually the case in a semi-structured dataset.

Second, Hyrise does not dynamically adapt its data layout to changes in workload and

data characteristics whereas our system takes into account and addresses these needs.

We decided to target our efforts on semi-structured data because in recent years,

semi-structured data such as JSON (Javascript Object Notation) data have exploded

in popularity as many popular Web applications, big media, scientific applications, and

content management systems today use document store NoSQL to support JSON data,


and many web services APIs use JSON data as data interchange format, including Face-

book, Twitter, and many Google services [3]. While document store NoSQL systems

that support the storage of JSON data exist, they have drawbacks, such as having low

querying capability [3] and lacking data analytics operations [20]. The ability to store

JSON data efficiently in relational database is important in building a database system

for real-time analytics on today’s web applications. ARGO, a mapping layer for JSON

data, was developed to efficiently store JSON data into relational databases.

In this thesis dissertation, we first implemented a simple version of an in-memory

database system that stores data primarily in the main memory of a computer and has

the mechanism of co-storing certain attributes together through vertical partitioning.

We then performed some investigation on the in-memory database to come up with

some guidelines for how to vertically partition the in-memory data for different query

and data characteristics. Traditionally data are arranged in either column-based format,

where data belonging to the same attribute are stored together sequentially, or row-

based format, where data belonging to the same row are stored together sequentially. In

this dissertation, we call vertical partitioning the act of grouping subsets of attributes,

or columns, into separate tables to maximize performance for a certain workload and

dataset.

We showed that the layout in which data are stored in the main memory affects the

database’s performance significantly along with different query and data characteristics.

We also showed that having a hybrid strategy which vertically partitions the data by

grouping certain data together can achieve substantial performance improvement over

traditional row-based and column-based storage formats.

The main idea behind the increase in performance is to intelligently partition different

attributes of a dataset based on the workload characteristics and on a set of guidelines

and intelligently pick a partitioning layout that can lead to the best overall performance

for the workload, which include multiple queries, each with different access patterns and


characteristics. Our main assumption is that most real-life applications have query and

data characteristics that are somewhat stable within a certain period of time and there are

some optimal partitioning layouts of that data that would lead to the best performance.

Overtime, the query and data characteristics might change and the system should change

its data storage layout. However, this adaptive aspect of the system is not the main focus

of this dissertation.

We performed a set of investigations to gain some critical insights into the best prac-

tice in choosing the optimal layout under different conditions. We then leverage the par-

titioning guidelines we gained through our studies to develop a partitioning algorithm.

The partitioning guidelines are important when developing the partitioning algorithm

because it not only tells us what partitioning practices would lead to good performance,

but it also helps us limit the search space which could be really large if we have many

attributes in the dataset.

It is important to note that the partitioning algorithm is not the direct contribution of

this thesis as it was developed by another researcher in our group, Sahel Sharifymoghad-

dam, who has been working closely with the author of this thesis and was responsible for

developing the partitioning algorithm.

We evaluated the partitioning algorithm on our system with some synthetic work-

loads that we generated by modifying an existing micro-benchmark for JSON data called

NoBench. We showed that our partitioning algorithm, along with the hybrid storage

approach, yielded significant performance improvement over traditional row-based and

column-based format, as well as an alternative data structure, called Argo, that was

developed to enable JSON data storage in relational databases. We also attempted to

compare the partitioning algorithm of Hyrise with our own on a JSON dataset, but its

partitioning algorithm failed to complete while our system finished in seconds.

The rest of this document is organized as follows: Chapter 2 provides background

information that is relevant to this project. Chapter 3 describes our overall design and


some implementation details. Chapter 4 details our investigation and characterization of

the hybrid partitioning approach. Chapter 5 describes our partitioning algorithm and its

performance under a synthetic multi-query workload. Chapter 6 looks at related work

in this field and finally Chapter 7 concludes our findings and provides a glimpse of our

upcoming work.

Chapter 2

Background

In the past decade, there has been a lot of research in the area of in-memory databases.

In this chapter, we introduce some major in-memory database concepts that are most

relevant to our research, including the advantages of the paradigm in the context of

memory hierarchy of modern computers, challenges and opportunities in data layout in

the main memory, and characteristics of the workloads we are interested in.

2.1 The Need for Real-Time Analytics

Traditionally, database systems were mainly used for transactional processing, such as

sales order or banking transactions, where only a small portion of the database was

accessed in each transaction [9, 15]. Systems that were designed to handle these types

of applications are called OLTP (Online Transactional Processing) systems. In such

systems, data tuples are generally arranged in rows, data are stored on disks and cached

in main memory to increase performance [15, 6], and indexing techniques are used to allow

fast access to requested tuples [15]. Traditionally, OLTP systems were optimized with

the goal of minimizing the number of I/O accesses. In additional, recently, there have

been efforts toward developing in-memory transaction processing system, such H-store,

that can achieve much better performance than traditional disk-based OLTP systems [8].

8

Chapter 2. Background 9

On the other hand, over the past two decades, another type of database systems

had emerged to target Business Intelligence (BI) applications [9]. Such systems are

called OLAP (Online Analytical Processing) systems. These systems were designed to

efficiently scan and process data spanning only a few columns (attributes) but across

many rows of the entire dataset (e.g., to compute aggregate values and perform other

statistical analyzes for some variables) [6]. Real life usage examples include aggregating

sales statistics grouped by geographical regions, customer segments, product types, etc.

Recently, the use of columnar store has become increasingly popular for such systems to

allow faster access and better compression [15].

In the past, database systems are specialized for two categories of workloads: OLTP

and OLAP. Companies that run applications in both areas have two separate systems

for these two kinds of workload. In such an arrangement, transactional processing is

performed on a dedicated OLTP database system. A separate Data Warehouse is installed

separately for BI query processing. Periodically, data from the OLTP database system are

extracted, transformed into a format suitable for the data warehouse system, and loaded

into the data warehouse for future usage [9]. The problem with this setup, however, is

data staleness where data analytics application often do not run on the most up to date

information available [17].

While these technologies had been improved over the years and perform very well for

the applications they were designed for, the separation between transaction and analytical

processing systems presents a bottleneck in response times that is hard to overcome [6].

Many researchers, such as H. Platners et al. [16], have envisioned a world where users

of business applications can interact with their software in a natural and real-time way

with an experience similar to the way current Internet users interact with the web search

engine. In the case of modern web search experience, users interact with the engine by

typing in what they want to search, and keep refining the search queries after getting

near real-time response until the desired results are obtained. Being able to get the


most up-to-date and relevant search results instantly is critical in this process. Business

applications share the same goal in this regard. For example, a manager might be looking

for real-time insights on the company’s sales to make better decisions on business actions.

Also consider a situation in which employees at major social networking sites might need

to monitor the performance of the site by interactively searching for different parameters

about the site, to spot any issues and inform engineers within the company to address

them right away. In his book [16], Platners envisioned that in future business meetings,

attendees located at different locations around the world will be interactively browsing,

querying, and getting answers on the most up-to-date information about their business

without the need to wait for a long time to prepare and create new reports.

While the analogy with the web search engine experience might seem intuitive, analyt-

ics in business database is actually very different. In the case of search engines, only the

search results that are most important are needed and shown, whereas business analytics

requires scanning and aggregating large amount of data all of which are of interest. In a

nutshell, a database system that would be suitable for the job must have most up-to-date

data at the user’s disposal, be able to answer ad hoc queries anytime instead of relying

on predefined reporting query, and provide sub-second response by doing computation

on the fly [16].

As mentioned, traditional databases are not suitable for real-time analysis due to the

separation of transactional and analytical processing systems in most enterprises. Tradi-

tional OLTP systems have been optimized to minimize insertion, update, and deletion of

a small subset of records in the database. However, as the volume of the data explodes,

these arrangements make OLAP very slow, since a large number or all of the records in

the database need to be scanned. Because of this, OLAP systems have been developed

to have special data structures that provide better read performance for analytics appli-

cations [19]. The separation of the two systems requires periodical loading of data from

the transactional database to the analytical system through Extract, Transform, and


Load (ETL) processes, which introduce a delay in the data that the analytical system

has access to, making it impossible to provide the most up-to-date results in real-time to

its users.

Again, the reason that a separated OLAP system is needed is because traditional

RDBMSs cannot answer queries fast enough due to slow read performance from disk.

Understanding these constrains, researchers began to realize that in-memory databases,

which can achieve much better performance than disk-based systems since memory ac-

cess is much faster than disk read, is the answer to boosting databases’ performance in

answering analytics queries and therefore can remove the need for a separate analytics

system. In the next section, we introduce in-memory databases in greater detail.

2.2 Reasons for In-Memory Databases

In-memory databases, as the name suggests, are essentially any database management

systems where the primary data are stored in the main memory instead of on disk. The

motivation of keeping database in main memory is simple: the main memory access is 4

orders of magnitude faster than the access to disk [16]. While the concept of in-memory

database is not new, it is not until this decade that Dynamic Random Access Memory

(DRAM) has become inexpensive and large enough in modern machines for the concept

to be realistic in real-life scenarios with large amount of data need to be stored and

analyzed.

The combination of faster speed, lower cost, and larger memory size in machines has

led to a greater interest in in-memory databases since the mid-2000’s. Some examples

include Oracle’s TimesTen system, which is essentially a relational system loaded into

memory, and SAP’s HANA database, which offers columnar or row-based storage [12].

In house systems like Scuba developed by Facebook and PowerDrill by Google have

in-memory databases in the core, coupled with different features for their perspective


Figure 2.1: Price per MB comparison between main memory, flash drives, and disk drives[16]

usage.

It should be noted that one concern many people have with in-memory systems is

the volatility issue of main memory such that data could be lost when power is lost.

This issue can be overcome with NVDIMM (Non-volatile Dual In-line Memory) [12].

Also, many in-memory database system periodically age data and store them in disk for

persistence [1, 16].

While in-memory databases have the potential to offer faster analytics and remove the

need for a separate OLAP system, fully exploiting its potential comes with challenges.

The next two sections first visit some concepts about memory hierarchies in modern

computers and the opportunities in data layout design.

2.3 Memory Hierarchies

While main memory is built from DRAM and one might expect that accessing data from

the main memory would have constant access time regardless of memory location, it is


not entirely true because of the cache effects just described. Depending on how data for

the calculation are organized in the main memory, accessing the same amount of data

needed by the CPU might require different number of cache misses, leading to different

total execution time to complete the same calculation.

In modern computers, when data is loaded from the main memory to CPU, it is

also cached at different cache memories that are closer to the CPU than main memory.

Figure 2.2 shows a common organization of a multi-core CPU, L1 to L3 caches, and main

memory. In a nutshell, when a block of data residing in the main memory needs to be

loaded into the CPU, it has to be loaded into cache and transferred through different

levels of caches before it can reach the CPU.

Figure 2.2: Organization of modern multi-core CPU [16]

During this process, data are not loaded into cache byte by byte, instead, each cache

level is loaded in a per-cache-line basis. Towards this, cache memories are divided into

units called cache lines. When data is loaded from the memory to the cache, the entire

cache line is filled, even if the CPU only needs a small piece of data with size smaller

than the cache line.


Levels that are further away from the CPU are larger in size, but have lower data

access performance. For example, accessing a value from the L1 cache can take 80 times

less CPU cycles than accessing it from the main memory [16]. Therefore, from the

performance perspective, it is desirable to have data that are accessed together to be

stored together in memory so that with one cache miss multiple values that are needed

would be loaded into cache, avoiding future additional cache misses to get values from

the main memory, which is expensive. This concept is often referred to as spatial locality.

Moreover, CPU manufacturers also implement prefetching algorithms so that co-located

data are loaded right after the first load. If the time to prefetch is smaller than the

program execution time, the second load is not going to cause additional program stalls.

Therefore it is wise to co-locate data that are accessed together.

From the perspective of the programmer, one can achieve much better performance if

data that are accessed together are stored close spatially in memory and read sequentially.

2.4 Column-Based, Row-Based, and Vertical Parti-

tioned In-Memory Data Layout

Most of the current in-memory database systems are divided into two paradigm: column-

based or row-based store. For example, Scuba from Facebook uses a row-based approach

and systems, such as, Dremel [13] and PowerDrill [7] from Google all use a column-based

approach. Traditional research in column versus row-based in databases have been focus-

ing on the on-disk setting. SAP HANA allows system administrator to specify whether

to use either row-based or columnar store [5]. Oracle Database In-Memory system has

a dual-format architecture that enables tables to be simultaneously represented in mem-

ory using both formats and automatically route queries to one of the two formats [14].

However, in in-memory databases, other than using the two extremes, there are new

considerations and opportunities given our understanding of the memory hierarchy.


In a nutshell, in a column-based implementation all data of an attribute, or column

as they are alternatively called, are stored together. In contrast, in a row-based imple-

mentation, all attributes of one record, or row, are stored together. The two different

approach can lead to very different data access patterns for the same query and dataset,

and as a result have very different number of cache miss ratio, which ultimately lead to

different performances.

Consider a simple query ”SELECT * (all attributes) FROM table WHERE Y =

something”, which is a typical database operation to select and display a single record,

a tuple, from the table. For example the user might want to display all the information

about a customer order given a customer ID. Figure 2.3 shows typically how it could be

accessed in the two approaches. The example on the left represents a row-based format.

It only takes one cache miss to retrieve all 8 attributes (attributes a1 to a8 in the first

row in the example) of a specific record, assuming only 1 record is selected, because 8

attributes of the record are all adjacent to each other and can be fitted into a cache

line. In example on the right, representing columnar format, multiple cache misses are

required since the 8 attributes from the requested record are spread out far apart in

memory so that it takes 8 cache misses to get everything we need. Of all the data that

have been loaded into the cache, only 1/8 of them are actually needed.

On the other hand, in query ”SELECT X FROM table WHERE Y = something”,

which is typical in analytical workloads, only one attribute is aggregated and another

attribute is used as predicate, with the remaining attributes untouched at all. In this

case, very different behaviours would occur. Shown in Figure 2.4, column-base format,

shown on the right, results in much less cache misses than row-base format. It only takes

1 cache miss to load all a4’s and another cache miss to load all a1’s.

These two simple examples show very clearly how different data layout within the

memory can lead to very different levels of cache miss ratio and can impact the per-

formance significantly. Researchers, such as H. Platners et al, have suggested a hybrid


Figure 2.3: Accessed pattern and cache behaviours comparison for OLTP type query [16]

strategy. Specifically their suggestion is to separate the data into different vertical parti-

tions. Vertical partition is essentially a group of attributes that are grouped together as

a separate table and record within this table are stored in a fashion similar to row-based

store. They have shown that given many systems have a mixed OLTP/OLAP workload

[11], having a hybrid layout would lead to the best outcome [16, 17] with the assumption

that we can know the workload ahead of time.

However, as the number of attributes being accessed by a workload containing multiple

queries increases, so is the number of possible partitioning layouts, which are essentially

different ways to group a set of attributes together. Therefore, one major challenge to

the hybrid paradigm is to discover the best layout within an acceptable amount of time


Figure 2.4: Accessed pattern and cache behaviours comparison for OLAP type query [16]

for a given set of queries. Previous work in Hyrise has suggested that the choice of

layout can be based on a comprehensive cache misses model [6]. However, the search

time can be very large if the number of attributes are also large, which is the case in

many modern datasets. For example, as mentioned previously in Chapter 1, we tried to

run the partitioning algorithm introduced by Hyrise [6] on a JSON dataset that contains

over 1000 attributes, many of which are sparse, the program did not finish several ours

later and we had to terminate it. Discovering the best vertical partitioning layout within

an acceptable period of time is the main focus of this project and the remaining of this

thesis.


2.5 Workload and Data Characteristics

In this section, we describe the workload and data that we are targeting and therefore

set the scope of the current and future work on this project.

Semi-structured data models have been used in many different applications. One

example of semi-structured data that have gained significant interest in Web 2.0 is the

JSON (Javascript Object Notation) data model. JSON has gained a lot of interest be-

cause it fits naturally with programming languages like Javascript, Python, Perl, PHP,

etc [3]. It also provides the flexibility to programmers so that they do not have to provide

a schema upfront. Because of these advantages, many web applications and systems are

now powered by systems that support JSON stores [3]. In addition, JSON has become a

dominant standard for data exchange among web services with their APIs using JSON

as a data interchange format. Examples include the APIs for many Twitter, Facebook,

and Goolge services [3]. Because of these reasons, effective JSON-based stores are in de-

mand, especially for programs that need to communicate with these popular web services.

While NoSQL systems are popular for JSON document stores in recent years and have

their advantages, they lack advantages, such as query capabilities, that are supported

by relational database management systems (RDMS) [3]. Therefore, there have been

research efforts in storing JSON data format in relational database management systems

[3]. Similarly, we chose to focus on the storage of JSON data in relational format.

The JSON data model consists of four primitive types: String, Numbers, Boolean,

and Null, and two structured types: Objects and Arrays [3]. An example of JSON is

shown in Figure 2.5. In a nutshell, within a JSON object, there are multiple key and

value pairs, and values can be nested arrays and objects of values. The JSON model has

several characteristics that are important to us when designing a database system.

First, many users prefer the JSON data model because of its ease of use as they do

not need to provide a schema upfront to define what attributes are in the dataset. A

new record can come in with new attributes that have not been encountered previously


Figure 2.5: Examples of JSON object

in the prior records. Therefore, when designing a data storage system for JSON data, it

is important to maintain this level of flexibility in dealing with new attributes as they

are encountered and as data structure evolve over time.

Second, as an outcome of the first factor, data sparseness could arise since many

attributes only occur in a subset of records, but not all. If we are to store JSON dataset

in a relational format, it is important to keep this in consideration since if handled

inappropriately it could result in many null values in the table, which, as we will observe

in later chapters, could degrade performance.

Third, JSON data can be hierarchical. This means that within a JSON object, there

can be any levels of nested arrays and objects, as shown in the example in Figure 2.5.

Moreover, there can be arrays of objects and vice versa.

Fourth, since users do not need to provide a schema upfront, the data type of each

attribute is not set and enforced throughout the dataset. Therefore, value of an attribute

in an object can have a different data type as the value of the same attribute in another

object. For instance, an attribute called ”age” can have an integer value of 10 in one

object but a string value of ”ten” in another object.

Many systems, including some of the examples we gave in previous sections and

chapter, were designed with these characteristics in mind. For example, Facebook’s Scuba


supports sparse data effectively without the need for an upfront schema [1]. Dremel has

a very comprehensive and complex data model to deal with nested data efficiently [13].

In terms of the queries in the workload, as mentioned we are interested in building

a system that supports mixed OLTP/OLAP workload. In their study, J. Krueger et al.

investigated several scenarios and found that many enterprise applications have mixed

workload and believed that many of the current benchmarks, including TPC suite, do

not cover a mixed workload [11]. However, there is no good standard benchmark for

future research in this regard.

2.5.1 NoBench: A JSON Micro-Benchmark

The NoBench benchmark presents a data generator that produces a series of JSON

objects that mimic the characteristics of JSON objects that are commonly stored in real

life. All data are in the format of ”key”:”value” pairs. Two JSON object examples in

the NoBench dataset are shown in Figure 2.6.

Over the entire dataset, there are a total of 1000 sparse attributes and 19 non-sparse

attributes. Each JSON object in the dataset includes a number of (usually 10) sparse

attributes, which only occurs in a small subset of all the records in the entire dataset,

and a number of (from 10 to 15) non-sparse attributes, which occurs in every object in

the dataset. Therefore each object has a total of 20 to 25 attributes, including unique

strings, numbers, dynamically typed attributes, nested arrays, and nested objects.

Sparse attributes, dynamically typed attributes, nested arrays and objects can be

observed from the two examples. Sparse attributes are those with keys that are in the

form of ”sparse xxx”. Dynamically typed attributes are ones with keys that start with

”dyn”. Nested arrays are enclosed by ”[” and ”]”, while nested objects are in ”{” and

”}”.

Below are some query examples in the NoBench benchmark.

• SELECT str1, num FROM nobench main;


Figure 2.6: Examples of JSON object in NoBench dataset

• SELECT nested obj.str1, nested obj.num FROM nobench main;

• SELECT sparse XX0, sparse XX9 FROM nobench main;

• SELECT * FROM nobench main WHERE str1 = XXXXX;

• SELECT * FROM nobench main WHERE num BETWEEN XXXXX AND YYYYY;

• SELECT * FROM nobench main WHERE XXXXX = ANY nested arr;

• SELECT * FROM nobench main WHERE sparse XXX = YYYYY;

• SELECT COUNT(*) FROM nobench main WHERE num BETWEEN XXXXX

AND YYYYY GROUP BY thousandth;

• SELECT * FROM nobench main AS left INNER JOIN nobench main AS right

ON (left.nested obj.str = right.str1) WHERE left.num BETWEEN XXXXX AND

YYYYY;

• LOAD DATA LOCAL INFILE file REPLACE INTO TABLE table;


We used this NoBench dataset and variations of its queries heavily in many of our

experiments. In the remaining of this thesis, the NoBench dataset was used as our

experimental data unless otherwise specified. The NoBench queries were also used by

changing different conditions on how they can be experiments, by changing the number of

accessed attributes, the selectivity (for ones with condition predicate), and by combining

different queries into a multi-query workout.

Chapter 3

Design and Implementation

In this chapter, we first give a high level overview of the different components in our

system and how they interact with each other. We then introduce each component and

its implementation in greater detail.

3.1 Overview

Figure 3.1: System overview

23

Chapter 3. Design and Implementation 24

An overview of the implemented system is shown in Figure 3.1. Currently the system

comprises four key components: 1) an in-memory data container in which data are stored

in table-like data structure in main memory, 2) a data manager responsible for controlling

how data should be partitioned and stored in the in-memory data container, 3) a parti-

tioner which takes workload and data characteristics as input and intelligently generates

the optimal partitioning layout on how the dataset should be vertical partitioned, which

is used by the data manager to control data storage layout in main memory, and 4) a

query engine which receives and executes queries one by one by retrieving the desired

data from the in-memory data container.

As shown on the left of Figure 3.1, from the outside world our system receives JSON

data objects that are to be stored in the system and queries that should be executed to

retrieve information from the stored data. By profiling the received queries and data over

a period of time, the system can extract important characteristics, such as: the number of

different types of query, the access pattern of attributes by different queries, the relative

importance or frequency of the queries in the query mix, the selectivity of each query, the

number of attributes in the dataset, the ratio of sparse attributes, defined as attributes

that only show up in a small subset of all data records, etc. All of these characteristics

are important factors that could affect what data layout should be used when storing

these data so that queries execution time over a period of time can be improved.

After data and query characteristics are captured over a period of time, and assuming

these characteristics are stable in the near future, the system can now generate a data

partitioning layout and store data accordingly. This is where the partitioning algorithm

come into the picture. The partitioner takes data and workload characteristics as inputs,

and then generates a layout that can lead to the optimal performance for the current

workload and dataset based on some heuristics. As illustrated in the top part of the

figure, the partitioner outputs a layout, which is essentially instructions on how to group

attributes of the dataset into different partitions.


Subsequently, the layout generated by the partition is used by the in-memory data

container to build a separate table in main memory for each partition and popular them

with data. The bottom right section of Figure 3.1 illustrates this concept of breaking

dataset into different (3 in the example) smaller tables.

For each query, the query engine retrieves data that are needed from the in-memory

data container and returns appropriate results to the users.

3.2 In-Memory Data Structure

In the in-memory data storage component, four primitive data types are supported: long,

boolean, string, and null values. The in-memory data storage component of our system

takes as input a partitioning layout from the partitioner on how to separate attributes

of the dataset into different partitions. Consider a dataset with N attributes in total, it

can be partitioned into 1 to N different partitions. Having 1 partition is equivalent to

the traditional row-based storage format, and partitioning into N different partitions is

the same as the traditional column-based storage format. We call having more than 1

and less than N number of partitions, with some attributes being grouped together, an

vertical partitioned layout, which essentially a hybrid mixture between the two extremes

of column-based and row-based format.

As illustrated in Figure 3.2, in our data structure, each partition includes the following

elements:

• Schema table: this is a string array that stores meta-data, including attribute

names and attribute data types. Name and type of each attribute are combined

into the format of ”Name:Type”. For instance, an attribute with the name ”Age”

and type of long number is represented by ”Age:Long” as an element in this array.

The size of this array equals the number of attributes in the partition. When we

need to access a certain attribute in the table, the system would first scan this


Figure 3.2: Example of how the system stores vertical partitioned data in main memory

schema array to identify if the attribute exists in this partition, and if so, what

positions its values are within the main data table.

• Main data table: an array of long type that stores the actual data, in the case of

long, boolean, and null values, and in the case of string type it stores reference to

the memory locations that store the actual string values. This is the table that

stores the major content of the data in the partition. In this long type array, long

values are stored as is. Boolean values are stored as either 0 or 1 to represent false

and true, respectively. Null values are stored as -11111111, assuming -11111111

does not occur in the dataset. String values are stored in a memory bytebuffer,

and the positions in the bytebuffer where the string values locate are stored in the

main data table as long values. In addition to the data content, the main data

table also stores object IDs before the attributes of each object begins, as shown in


Figure 3.2. Objects IDs are needed so that objects can be restructed later on even

if attributes belong to the same object are separately across different partitions.

The object IDs are needed also for omitting records that miss all attributes of a

partition. It is important to note that records/rows in this main data table all have

fixed length, corresponding to the length of the schema array, and missing values

will be stored as null values. The fixed length arrangement is too make it easier to

traverse this data table and skip to the right position quickly.

• An in-memory bytebuffer that stores the actual string values, with their positions

in the bytebuffer being stored in the main data table.

3.3 Query Execution

In this section, we describe how typical queries are executed in the engine.

3.3.1 Table Creation

Before any data can be populated and accessed from the engine, a table needs to be

created first. The engine takes as input a partitioning layout, creates a number of par-

titions, and populating the schema table in each partition. Subsequently, when data

records are inserted in the table, different values for different attributes are inserted to

the correct partitions based the layout. In case when a new attribute is encountered

which has not been defined during table creation, the system then automatically creates

a new partition just for that attribute. Therefore, it avoids the need for the programmer

to produce a schema upfront. Overtime if the data characteristics change significantly,

for example when a large number of previously nonexistent attributes start to arrive in

the database, a re-partitioning process can be triggered so that the partitioning algorithm

can re-generate a new data layout.


3.3.2 Data Insert

After the table is created based on the vertical partitioning layout, it needs to be popu-

lated with records before other queries can be executed. The engine supports both bulk

insertion of multiple records and the insertion of individual records.

Each object that comes into the system is assigned an system-wide unique object ID

before being inserted into the appropriate partitions. A new row, along with the unique

object ID, is added to each of the partitions that are involved in storing the attributes

in this incoming record. However, if none of the attributes in the incoming object is in

the schema array of a particular partition, that partition is not involved and therefore

no new row and data will be added for this object. The schema tables of the involved

partitions are scanned first to make sure data can be inserted correctly.

As shown in the example in Figure 3.3, hierarchical and nested attributes of JSON

objects are flatten and treated as separate attributes in our system. For example,

”Kid.Name” can be used to represent the attribute called ”Name” within the nested

object called ”Kid”.

Figure 3.3: Example of how the system flattens JSON objects for storage in relationalformat


Data sparseness arises when some attributes only exist in a small subset of objects.

In our implementation, if an attribute is missing in a record, a nulls value, represented

by -11111111, is stored instead to indicate missing value. This can also be observed from

the example in Figure 3.3. The attribute ”Kid” does not exist in the example on the left.

As a result, null values are inserted into the first row of the table for attributes that are

within ”Kid”. The total number of null values depends on the partitioning layout and can

have a significant impact on the performance of the engine. This aspect is investigated

further in Section 4.3.

3.3.3 Select

Examples of selection queries that the system supports are shown below:

1. SELECT A,B,C FROM table

2. SELECT * FROM table

3. SELECT A,B,C FROM table WHERE D = ”Some string”

4. SELECT A,B,C FROM table WHERE E BETWEEN ”value1” AND ”value2”

5. SELECT A,B,C FROM table WHERE ”Some string” = ANY F

6. SELECT COUNT(*) FROM table WHERE E BETWEEN ”value1” AND

”value2” GROUP BY G

Query 1 to 5 are common selection queries that retrieve some stored data. Query 6

aggregates based on some conditions and returns the aggregated counts.

Query 1 and 2 are similar that they both simply select some attributes from the

table, without any condition evaluation. Therefore, all rows will be accessed to select the

required attributes in the select clause. The symbol * in query 2 is used to denote ”all

attributes”.


Query 3, 4, and 5 are similar that they all select some attributes from the table for

records that meet some conditions, as specified by the where clause. Query 3 checks if

the attribute D matches some specified string. Query 4 checks if the attribute E has a

numerical value that is between two values, which can be used to adjust the selectivity

of the query during experiments. In query 5, F is a nested array with multiple values.

This query checks if any of the F[x] attributes of a record matches a specified string.

For query 1 and 2, the engine first checks the schema tables of all partitions in order

to identify partitions that contain attributes requested by the query. Then the engine

goes through the involved partitions one by one to select required attributes by reading

from the main data table of each partition. As mentioned before, the main data table of

each partition is essentially an 1 dimensional array. To read only a subset of attributes

from this 1D array, only the array elements that correspond to these attributes, which

can be inferred from the schema table, are read. After a value is read, it is immediately

placed into an answer object, which contains arrays that store the selected values, keys,

and object IDs in memory and is returned to the user at the end of the query execution.

For queries such as query example 3, 4, and 5, the partition which contains the

attribute in the where clause is first scanned. For each record, the where clause attribute

is first read and evaluated. Meanwhile a list of object IDs of the records that meet the

condition is constructed.

If all or a subset of the attributes in the select clause are also in the same partition

as the attribute in the where clause, after reading and evaluating the condition, select

attributes that are also in the same partition are also selected. This process is repeated

until the end of the partition is reached. If none of the attributes in the select clause is in

the same partition as the where clause, after reading and evaluating the where attribute

of a row, the engine would continue to evaluate the where attribute of the next record

and repeat until the end is reached.


The remaining partitions that contain any select attributes are then scanned one by

one to gather the remaining data based on the object ID list that was built while scanning

the partition with the where attribute. When traversing through a partition, it would

only read records with object IDs that are in the object ID list created earlier.

These arrangements are to improve performance by avoiding jumping between differ-

ent partitions, and hence different distant memory locations, back and forth, whenever

a record is selected. Figure 3.4 shows a simple example of this arrangement. In this

example, the query ”SELECT C2,C3,C4 WHERE C1=something” was executed on a

layout with C1 and C2 in one partition and C3 and C4 in another partition. Assuming

that the C1 values in row 1 and 3 satisfy the condition, the system would skip the values

associated with row 2 and insert object ID of row 1 and row 3 so that C3 and C4, which

are in another partition, can be selected later after the system finishes scanning the first

partition.

Post-Select Record Ordering

As mentioned above, when data are selected from the data table during the execution

of a query, they are placed into a temporary in-memory answer object, which is then

returned to the user in the end.

One problem that arises when data are stored in an vertical partitioned layout or in

a column-based format is the order of the selected data in the answer arrays. In most

applications, when a subset of attributes are requested, it is desirable to return the results

in a format where the selected values belonging to the same object are stitched together

into tuples.

If all requested attributes are stored in the same partition, no problem would occur

since data belonging to the same object are essentially adjacent to each other and are

scanned and inserted into the answer object in the correct order. However, if the re-

quested attributes are fragmented such that they are stored across multiple partitions,


Figure 3.4: Example of how the query ”SELECT C2,C3,C4 WHERE C1=something” isexecuted

using our implementation, for leveraging locality to the fullest, each of the involved parti-

tions is scanned fully before moving on to the next one. Therefore data would be scanned

and inserted into the answer object in fragmented tuples, which is not desirable. Figure

3.5 illustrates this concept.

In the latter case where the selected data arrive at the answer object in an undesirable

order, an additional step is necessary to sort the values and stitch them back into object-

based tuples before returning to the user. In a nutshell, this ordering step is done by

constructing a series of linked list, one for each object that is selected, as the data arrive

into the answer object. Each linked list contains a list of position references indicating


Figure 3.5: Example of post-select record ordering

where the data belonging to each record can be found in the pre-ordered answer object.

These linked list were then used to construct another answer object by copying data from

the pre-ordered answer object to the new one in an record after record manner.

This data ordering step would result in additional query execution time in certain

partitioning layouts and its effects are shown in some of our analyzes later on.

3.3.4 Inner Join

Our implementation also supports inner join between two tables, for example: SELECT

left.A,right.B FROM table1 AS left INNER JOIN table2 AS right ON (left.C =

right.D) WHERE left.E BETWEEN ”value1” AND ”value2”.

Queries of this type typically include an evaluation clause, a selection clause, and a

join predicate clause which determines how entries from the two tables should be joined


based on the equijoin predicate fields. We implemented hash join to support the execution

of such query. In the ”build” phrase, the engine first scans the first table to build a hash

table with the predicate attribute as key and the row ID as value, and evaluates the

condition in the where clause at the same time to avoid inserting unnecessary entries

into the hash table. In the ”probe” phrase, the engine then scans the second table, looks

up relevant rows by looking up the hash table, and builds a list of object IDs for the

records from both tables that are selected. Attributes in the selection clause are then

selected from the tables using the object ID list in similar procedures described in the

previous section.

3.4 ARGO: Alternative Data Structure for JSON in

Relational Format Used for Comparison

3.4.1 ARGO Concepts

Argo was introduced as a data structure that maps JSON objects into relational format

so that they can be stored in traditional relational databases [3]. It has two methods:

Argo1 and Argo3.

The concept of the Argo1 data structure can be visualized by Figure 3.6. In Argo1,

all data in the dataset are stored in a single 5-column table, and each attribute and its

value are stored in a separated row in the table. The first column stores the object ID

of the object that the attribute is in. The second column is used to store the key of

the attribute. Depending on the data type of the attribute, its value would be stored in

the third, fourth, or fifth columns, with nulls in the remaining two columns that are not

used. For example, if the value of the attribute is a long number, the value is stored in

the fourth column, and the third and fifth column have null values.

In Argo3, data are separated into three tables, one for each of the data types that it


supports: long, string, and boolean. It can be demonstrated in Figure 3.6. Each table

has three columns. Similar to Argo1, the first two columns store the object IDs and

attribute keys. The last column stores the actual value of the attributes. Depending of

the data type, an attribute and its value are stored in one of the three tables.

Figure 3.6: Transformation of JSON objects into relational format using Argo1 and Argo3

Both Argo1 and Argo3 data structure have the advantage of being flexible for JSON

data and easy to use, as users do not need to provide a schema upfront. The table

representation takes care of issues, such as, sparse attributes, hierarchical data, and

dynamic typing given the way they were designed. However, we wanted to focus on

query execution speed and compare its performance in an in-memory setting with our

hybrid partitioning data structure using the very same NoBench dataset and queries. We

implemented both Argo1 and Argo 3 with our best efforts in the same language that we

used for our system, Java, and with the same practice of storing string values separately

in a bytebuffer and their references to the bytebuffer in the actual main data table.


3.4.2 Baseline Comparison with ARGO

In this section, we compare Argo1 and Argo3 against our vertical partitioning data

structure, as well as the traditional columnar and row-based approach. We compare

the query execution performance on a range of query types provided by the NoBench

micro-benchmark.

The advantage of Argo is in its flexibility and ease-of-use. However, it does not have a

way to optimize the layout of the data to improve performance. Users only has the option

of choosing either Argo1 or Argo3. Moreover, the distance between the same attribute

in different objects vary significantly across objects due to the fact that attributes can

arrive in any order and there exist sparse attributes. It is more difficult and less efficient

to access only a small subset of attributes because the engine would not know where to

skip to in order to reach a particular attribute.

With these concepts in mind, we performed a quick experiment on running 9 different

selection queries, with different selectivity and different number of attributes in the select

clause. The entire dataset has 200,000 objects. Each query was executed 10 times. The

largest two values in the 10 runs were excluded, and the average of the remaining 8 runs

was reported.

For the vertical partitioning method, a partitioning layout is needed, which is the

focus of subsequent chapters. For this experiment, we used a layout that partitioned the

sparse attributes in small chunks of 10 and had all non-sparse attributes in the same

partition. As we will see in Chapter 5, this layout is actually not be the best, and the

partitioning algorithm usually produces better partitioning layouts. We only used this

layout as a conservative comparison with the Argo formats.

Note that in this evaluation and subsequent experiments in the rest of this thesis

dissertation, we only measured the time a query takes to read data that it needs and

put them into an answer object in memory. However, the measured time does not

include materialization of the selected data. Moreover, in the case of string values, we


only measure the time it takes to read the reference from the main data table, without

reconstructing the actual string from the bytebuffer.

Figure 3.7: Baseline comparison among different data structure formats for JSON data

Figure 3.7 shows the results of this experiment. It shows that the vertical partitioning

format performed better than Argo and the two traditional formats in almost all cases

we tested except for one. The only exception is the last query, which selects all attributes

with a very high selectivity of 100%. In this case, the Argo formats slightly outperformed

our layout. This reason is that for this type of query with high selectivity, all data are

needed and are packed so that they also have good cache locality in both formats, and

Argo1 does not need to pay the price of post-select record ordering. However, we expect

this type of query to be extremely rare in practice.

Another interesting observation is that column-based performed well when there is a

small number of attributes in the select clause and poorly otherwise. The reverse was

true for the row-based format. Furthermore, the vertical partitioned layout was able to

match the performance of these two traditional formats in both extremes. This will be

investigated further in the next chapter.


3.5 Partitioner

In this section, we briefly introduce the partitioning algorithm. As mentioned in Chapter

1, this algorithm is not the direct contribution of this thesis as this is developed by a

collaborator. The purpose of this section is just to introduce this algorithm she developed.

The algorithm starts with an initial partitioning. At each iteration, it goes through

all attributes and partitions and for each attribute-partition pair it calculates the gain

of migrating the attribute to the partition. At the end of the iteration, it chooses the

attribute with maximum gain and migrates it in its new partition. The gain of a migration

is the difference of the cost function values before and after the migration. The algorithm

is complete when the maximum gain is zero.

3.5.1 Initial Partitioning

In the cases that starting from current layout is not possible or beneficial (e.g., a signif-

icant change in workload characteristics or generating the layout for the first time), the

algorithm would generate an initial layout. Two obvious candidates are row-based and

column-based layouts. However, due to the large number of attributes in the benchmark

dataset, having only one table or more than 1000 tables is far from the ideal. Thus,

starting from one of those extremes is not reasonable as it would converge either too

slowly or to a local minimum near the extreme starting points. Therefore, we decided to

use the following heuristic to generate the initial layout.

First, queries are sorted in descending order of their weights. For each query, all of

its attributes that are not already assigned to any partitions are grouped into one new

partition.


3.5.2 Cost Model

For each query, attributes that are in the accessed partitions but not actually accessed

by the query prevent its execution from exploiting the best cache locality, since some

attributes are brought to the cache and occupy cache space without being actually needed,

as we will discuss in Section 4.1. The following formulas represent the heuristic cost

of redundant attribute overhead for all queries over all partitions. Where sel(q, a) is

selectivity of query q for attribute a and sel(q, p) is defined as the maximum selectivity

of query q for all attributes of partition p, spa(a) is the sparseness ratio of attribute a

and spa(p) is similarly the maximum sparseness ratio of all attributes in partition p. In

equation 3.4 RAC is the abbreviated form of Redundant Access Cost, Q and P are query

and partition sets respectively and hasAttr(p, q) is a boolean function that is true if and

only if there exists an attribute in partition p that is accessed by query q. Finally w(q)

is the weight of the query q.

Equation 3.4 indicates that, for each query, those attributes that belong to accessed

partitions but are either not accessed by the query at all, or accessed with a fewer rate,

increase RAC and the increment is dependent on both the access rate and the sparseness

ratio differences between attributes and their underlying partitions. Larger differences

cause more non-accessed or null values in partitions and in turn poorer cache locality.

Furthermore w(q) is used as a coefficient in the equation because different queries have

different execution costs based on their relative frequencies in overall workload as well as

their actual execution times.

sel(q, a) =

1 a ∈ condition part of q

sel(q) a ∈ selection part of q

0 a /∈ q

(3.1)

sel(q, p) = max { sel(q, a) | a ∈ p } (3.2)


spa(p) = max { spa(a) | a ∈ p } (3.3)

RAC =∑q∈Q

∑p∈P

hasAttr(p, q)× w(q)×∑a∈p

(spa(p)× sel(q, p)− spa(a)× sel(q, a)) (3.4)

Putting each attribute separately in a different partition eliminates the RAC. How-

ever, the overhead of accessing several tables for queries that need wide accesses to many

attributes can be really high due to post-select data record ordering, as observed in the

case studies. Thus, a heuristic formula is necessary to indicate the cost of accessing

different tables for a single query execution.

To represent the mentioned cost, the problem is mapped to a graph partitioning

problem where attributes are graph vertexes and the weight of an edge between an

attribute pair shows the pair’s affinity for being used in the same query. In this mapping,

the cost of queries accessing different partitions is equivalent to summation over the

weights of all edges that lie between those partitions. In order to complete the mapping

it’s important to assign proper weights to each edge based on workload information

(e.g., queries access patterns and attributes sparseness ratios). Equation 3.6 shows the

formula for calculating weight (w(a, b)) of the edge between two arbitrary attributes, a

and b, where Qab is the set of queries that are accessing both a and b together. Note

that w(a, b) is the penalty cost that a layout would pay if it decides to map a and b to

different partitions. Thus, the cost is more expensive when the two attributes have similar

sparseness ratios and are accessed by many queries with similar access rates. Moreover,

w(a, b) between a sparse and a common attribute is small because keeping them in the

same partition introduces null values. Equation 3.7 shows the total Cross Partition Cost

(CPC) for the whole workload which is the summation over w(a, b)s of all attribute pairs

that belong to different partitions.


Qab = {q ∈ Q|a ∈ q ∧ b ∈ q} (3.5)

w(a, b) =min(spa(a), spa(b))

max(spa(a), spa(b))×

∑q∈Qab

w(q)× min(sel(q, a), sel(q, b))

max(sel(q, a), sel(q, b))(3.6)

CPC =∑

a,b∈A{w(a, b) | pa 6= pb} (3.7)

In this sub-section RAC and CPC are introduced as overheads of having redundant

attributes in accessing partitions versus accessing several number of partitions during

the queries executions. Since the overall cost function for evaluation of a particular

layout must consider both criteria at the same time, Layout Cost (LC) is the weighted

average of normalized values of both CPC and RAC (Equation 3.8). Normalization is

necessary to map the absolute values to 0-1 scale for a meaningful summation and alpha

is a workload dependent parameter that represents the relative importance of CPC and

RAC for the specific workload. Note that CPC and RAC get their maximum possible

values in column-based and row-based layouts respectively.

LC = α× CPC

CPCmax

+ (1− α)× RAC

RACmax

(0 ≤ α ≤ 1) (3.8)

3.5.3 Parameter Alpha: Fine-Grain vs. Coarse-Grain Parti-

tioning

One important implication of Equation 3.8 is that the parameter alpha determines the

granularity of the resulting partitioning layout. On a high level, a higher alpha value

causes the partitioning algorithm to produce a more fine-grained layout with tables of

smaller width, hence less attributes. In contrast, a lower alpha produces larger partitions.


In Chapter 5, we show that the performance of our system is sensitive to the value of

alpha used in the partitioning algorithm such that under different situations different

alpha values could lead to different performance results.

Let us consider an example: attributes A, B and attributes C, D are accessed by two

different queries without conflict. A relatively fine-grained layout would tend to have A

and B in one partition and C and D in another partition. A coarser-grained layout would

have all 4 attributes in one bigger partition. The first layout is better in this case due

to potentially better cache locality for both queries. The second layout has poorer cache

locality, since no matter which query is executed, untouched attributes are mixed with

accessed attributes in each cache access. However, the second layout is not bad because

at least no data ordering is needed and is still much better than a layout that places A

and C in one partition and B and D in another partition.

When the conflict level is higher, e.g., when query 1 needs to accessed attributes A

and B and query 2 needs to access B and C, a fine-grained layout would have to make

a choice on where to put B. Putting B with A favours query 1 and causes post-select

record ordering in query 2, and vice versa. This is where a less segmented layout would

be beneficial, since although neither query can enjoy the best cache locality, at least

neither would suffer from not having everything they need in one partition, which causes

expensive post-select ordering.

This concept is furthered studied and confirmed in Chapter 5.

Chapter 4

Vertical Partitioning

Characterization

With our system in place, in this chapter we first investigate the impact of different

partitioning practices by conducting several case studies. These case studies helped

us develop better understanding and some guidelines for vertical partitioning. These

guidelines were then incorporated into the development of our partitioning algorithm

introduced in Chapter 3.

4.1 Case Study 1: Co-Location of Attributes Ac-

cessed Together

In this case study, we investigate the effect of putting attributes that are accessed together

into the same partition. The idea can be demonstrated by the example shown in Figure

4.1. Consider a query example where 4 columns, all of which store long numbers, are

selected together when some selective conditions are met. Let us also consider two

partitioning layouts. The first layout has all 4 of these attributes in one partition, and

the second layout has 8 attributes in one partition, including the 4 accessed attributes

43

Chapter 4. Vertical Partitioning Characterization 44

and 4 other attributes that are not accessed by this query. Although the amount of

data that is selected is the same for both layouts, the performance would be different

due to the cache locality effect. Specifically, we expect the first layout to have better

performance, especially when selectivity is high, because when a particular row is loaded

into the cache, the subsequent row would also be in the cache, therefore avoiding the need

to read it from memory later on. Moreover, by having a lot of non-accessed attributes

in the same partition as those that are actually needed, data that are needed would be

spread out into more memory pages. Therefore, more memory pages need to be loaded

to select all of the requested data, which is expensive due to potential TLB(translation

lookaside buffer) misses and the resulting cost per access.

Figure 4.1: Differences in cache behaviours between layouts with and without unwantedattributes in the same partition as wanted attributes

4.1.1 Methodology

In this case study, we generated our own dataset with 200,000 records. Each record

contains 33 attributes with randomly generated values. The reason for not using the

NoBench dataset for this particular case study is that it contains many sparse attributes,

and we wanted to isolate the effect of data sparseness from this case study and examine


it later in Section 4.3.

Two groups of experiments were performed:

Experiment 1 investigates the effect of having unwanted attributes mixed in the same

partition as ones that are actually accessed by the query. We performed the experiment

with the query ”SELECT C1, C2, C3, C4 WHERE C33 BETWEEN x AND y”, where

x and y were values that were adjusted to control the selectivity of the query. In this

section, the attribute in the where clause, C33, is in a separate partition. The effect of

whether or not attributes in the where clause should be put together with select attributes

is investigated in the Section 4.2.

We tested the queries on column-based, row-based, as well as a series of vertical

partitioned layouts with increasing number of non-accessed attributes being mixed into

the same partition as the 4 accessed attributes. We expected to see the layout that has

only the four accessed attributed in one partition, with no other attributes, to perform

better than all other formats.

In experiment 2, we investigated the effect of breaking accessed attributes into mul-

tiple partitions. We considered a query example where 32 attributes, all of which store

long numbers, are selected together when some selective conditions are met. Specifically

we executed the query ”SELECT C1, C2, C3, C4, C5, C6, C7, C8, C9, C10, C11, C12,

C13, C14, C15, C16, C17, C18, C19, C20, C21, C22, C23, C24, C25, C26, C27, C28,

C29, C30, C31, C32 WHERE C33 BETWEEN x and y”.

The queries were tested on column-based and different vertical partitioned layouts

with the 32 attributes being segmented into various numbers of partitions. We expect

having all 32 attributes in one partition would have better cache locality since whenever

a row is selected, all of its attributes are in one partition and close by, leading to better

performance due to better cache locality. If we break these attributes into many different

partitions, then the attributes associated with the same record are stored in different

locations in the main memory, leading to poorer performance as the result of more cache


Figure 4.2: Performance comparison between different layouts when executing the query”SELECT C1, C2, C3, C4 WHERE C33 BETWEEN x AND y”

misses, since data need to be fetched from multiple distant locations. Moreover, as

demonstrated by the example in Figure 3.5, breaking co-selected attributes into multiple

partitions also leads to post-select data re-ordering as data from the same record come

into the answer array at different times and need to be ordered back to object-oriented

tuples.

In this case study, each query was repeatedly run for 10 times for each condition, and

the execution time were averaged after filtering out the highest 2 time values.

4.1.2 Results

Figure 4.2 shows the performance of different layouts when executing the queries in

experiment 1.

First, when the selectivity was very low, i.e. 1%, the performance of different parti-

tioning layouts did not vary much. This was because when the selectivity is very low,

the chance of the loaded data in the cache being needed next is also very low. Therefore,

the benefit of co-locating data that are accessed together is not obvious relative to the

short run time due to low selectivity.


Figure 4.3: Performance comparison between different layouts when executing the query”SELECT C1 - C32 WHERE C33 BETWEEN x and y”

Figure 4.4: Select and post-select record ordering time comparison between differentlayouts when executing the query ”SELECT C1 - C32 WHERE C33 BETWEEN 0 and5000 (50%)”


In contrast, when the selectivity was higher, e.g., over 25%, the chance of loading

soon-to-be-needed data after reading data into the cache was also higher. Therefore

query execution greatly benefited from having data that are accessed together in the

same partition without fields that were not accessed, as observed in Figure 4.2, thanks

to better cache locality on data that were requested.

Figure 4.2 also shows the LLC (last level cache) cache misses during the data selection

phase when the selectivity was 50%. It shows that having all 4 selected attributes in one

partition without other attributes led to the least number of LLC cache misses, and

the rate of LLC cache misses increased as more and more unnecessary attributes were

included in the partition. This suggests that the difference in execution time across

different layouts was driven by the differences in cache performance such that the layout

with the best cache locality and least cache misses produced the best performance. The

only exception was the column-based format, which generated little LLC cache misses

but still resulted in the worst execution time. This was due to the effect of post-select

data ordering.

Figure 4.3 shows the performance of different layouts when executing the queries in

experiment 2. As expected, the layout with all 32 selected attributes in one partition

performed the best. The more partitions the 32 selected fields were broken down into,

the worse the performance became due to post-select record ordering. Figure 4.4 shows

the effect of this step relative to the selection time when the selectivity was 50%. As

expected, when all needed attributes were in one partition no re-ordering was needed. It

also shows that when needed this step took a significant amount of time comparing to the

selection time. Also, the column-based format performed the worst due to both longer

selection time, caused by poorer cache performance, and long post-select data ordering

time.


4.1.3 Summary

From this case study, we can conclude that it is important to keep attributes that are

accessed together by a query in one separate partition, without breaking them into differ-

ent partitions and introducing other attributes in the same partition. This arrangement

is the best due to better cache locality and lower post-select data ordering time.

4.2 Case Study 2: Where Attribute Alone vs. To-

gether With Select Attributes

In this case study, we investigate the effect of different ways to put the attribute used

in the condition evaluation part of a query, namely the where clause. As mentioned in

Section 3.3.3, a selection query with a where clause is executed by first checking the

condition of each record in the partition that contains the where attribute. This is done

by reading the where attribute of the record and comparing its value with the evaluation

condition, usually by checking if its value matches a specified string or falls within a

numerical range. If the condition is met, attributes in the select clause are read and

the object ID is inserted into a temporary list for selecting other attributes in other

partitions, if there is any.

In this section, we investigate whether the where attribute should be put into the same

partition together with attributes that are in the select clause or in a different partition

alone. Given the way queries of this type are executed in our engine, as explained in

Section 3.3.3, having the where field alone would lead to better or worst performance in

different situations. This is because every element in the where attribute needs to be

read and evaluated, but some elements in the select clause might not need to be read

depending on the outcome of the condition evaluation. Having the where attribute alone

would allow the engine to quickly scan through the all where values with better cache

locality. This is more important when the selectivity is low, as the chance that we need


to read the select attributes is low.

4.2.1 Methodology

The same dataset used in case study 1 was used again in this case study. Two experiments

were performed. In this first experiment, we executed a series of queries with different

numbers of attributes in the select clause and with different selectivity. For each query,

two layouts were tested: one with select attributes and where attribute in the same

partition, and one with select attributes in one partition and where attribute alone in a

separated partition.

Experiment 1 and 2 differed such that in experiment 1 the where attribute was not

being selected, while the where attribute was also part of the select clause in experiment

2.

Same as before, each condition was run 10 times and averaged with the largest 2

values excluded. Also, the percentage difference between when the where attribute was

separated alone and when it is together with the rest of the attributes was also reported

for each condition.


4.2.2 Results

Figure 4.5: Experimental result for experiment 1


Figure 4.6: Experimental result for experiment 2

The result of experiment 1, where attribute in the where clause is not part of the select

clause, is shown in Figure 4.5. Several observations can be made:

First, storing the where attribute together with the select attributes led to better

performance in cases where the number of attributes in the select clause of the query

is below 8 in all selectivity level, as shown by the left portion of the chart. This was

due to the cache size effect. When the number of selected attributes was smaller than

8 long numbers. They had a combined size of less than 64 bytes, which can be fitted in

the same L1 cache line. Therefore, after the attempt to read and evaluate the condition

for a record, the data needed to be selected for this record was very likely to also be

in the cache already, avoiding a second memory access. Therefore, when the number of


attributes in the select clause is small, it’s better to have the attributes together in one

partition.

In contrast, when the query selected a lot of attributes, having the where attribute

alone in a separate partition led to better performance. The effect is most significant

when the selectivity was low, i.e. 1%. This was due to the fact that when the selectivity

was low, the chance of having to access the attributes in the select clause was also low,

therefore it’s beneficial to have the where attribute alone so that it can be scanned very

quickly for better cache locality.

The result of experiment 2, where the attribute in the where clause was also part of

the select clause, is shown in Figure 4.6. It shows a very different trend than those in the

first experiment. With the few exceptions where the selectivity was low, in most cases,

having the where attribute alone actually led to poorer performance than otherwise, with

as much as 150% decrease in performance.

This was due to the fact that the need for post-select data ordering actually out-

weighed the benefit of having the where attribute alone. The second chart in Figure

4.6 shows a break down of select time and post-select record ordering time when the

selectivity was 50%. It shows that the post-select data ordering time actually was the

source of degraded performance when the where attribute is in a separate partition.

4.2.3 Summary

To conclude this case study, if the attribute used in the condition evaluation part of

a query is not also being selected, or if the select clause of the query contains a lot of

attributes, it is beneficial to have the attribute in a separate partition. Otherwise, it is the

best to have all attributes, including the where attribute, in the query to be partitioned

together.


4.3 Case Study 3: Sparse Attributes Separated and

Fragmented vs. Together with Non-Sparse At-

tributes

As previously mentioned, one important characteristic of the JSON data model is data

sparseness, which arises when many attributes only exist in a small subset of records. In

this case study, we investigate the effect of data sparseness and try to gain some insights

in how to partition data with a lot of sparse attributes. The NoBench micro-benchmark

described in Section 4.1 is suitable for this case study as it contains a significant amount

of sparse attributes.

Figure 4.7: Effects of vertical partitioning on null values in the data tables

Data sparseness is an important consideration in deciding the partitioning layout.

Imagine a simple example illustrated in Figure 4.7, different layouts actually generate


different amount of null values from the same dataset.

If there are many records in the table, then the amount of unnecessary null values

as a result of a poor layout could be substantial. Large amount of null values in a

table would degrade cache locality of non-null values that are actually needed, leading

to poorer performance. Moreover, during query execution, null values still need to be

read and compared before the engine can distinguish it from non-null values, which takes

additional time to perform.

On one end of the spectrum, having a row-based layout could lead to a large amount of

unnecessary null values, especially when there are many sparse attributes in the dataset.

For instance, in the first layout example in Figure 4.7, a row-based layout results in 6

null values. On the other extreme, having a column-based layout reduces the number of

null value to 0, as shown by the third layout example in the same figure. However, as we

observed earlier, breaking attributes into very small partitions would also lead to poorer

performance due to the combined effect of post-select ordering and the need to jump to

different memory locations multiple times.

Therefore, we hypothesized that the hybrid paradigm of vertical partitioning could

perform better than the two extremes. We performed a series of experiments to confirm

the effects of data sparseness in degrading database performance, and more importantly

to gain some insight in how to best deal with sparse attributes.

4.3.1 Methodology

In this case study, 5 experiments were run with different subsets of the NoBench dataset.

Each of the 5 experiments used different dataset that included 4 non-sparse attributes

and different number of sparse attributes, ranging from 4 sparse attributes in experiment

1 to 64 sparse fields in experiment 5.

In each experiment, different queries, which had all attributes being considered in that

experiment in the select clause, with different selectivity were executed. We tested the


performance of column-based and row-based formats, as well as different vertical parti-

tioned layouts. In the different hybrid layouts, non-sparse fields were keep in a separated

partition, and sparse-fields were segmented into different partitions with various partition

sizes, ranging from 1 to the maximum number of sparse attributes in the experiment.

Different layouts tested in the different experiments can be seen on the legend on Figure

4.8.

Like before, each query was repeatedly run for 10 times for each condition, and the

execution time were averaged after filtering out the highest 2 time values.

4.3.2 Results

The experimental results for this case study are shown in Figure 4.8. Several observations

can be made:

When the amount of sparse attributes in the dataset was small, i.e. less than 16,

shown by the first three charts in Figure 4.8, the impact of having null values in row-

based format was low such that row-based performed the best. Column-based format

had the worst performance because of the post-select record ordering of data due to the

separation of co-accessed data into many different partitions. Hybrid layouts performs

somewhere in between the two extremes. Therefore, when the dataset only has a small

amount of sparse attributes, it is the best to put all accessed attributes into one partition

although it does introduce some null values.

However, as the amount of sparse attributes in the dataset increased past a certain

level, e.g., 32, the benefit of having a hybrid approach became obvious. As shown by the

last two charts in Figure 4.8, some of the hybrid layouts outperformed row-based, and

the advantage of row-based over col-based started to narrow. More importantly, how the

sparse attributes were partitioned also mattered. There was an optimal partition size

for sparse attributes that led to the best performance. The reason is that having large

partition size causes more null values, while having partition size that’s too small would


lead to too much segmentation of data, followed by increased post-select ordering time.

With these in mind, we also performed an experiment where the entire dataset in

NoBench was used, which includes 1000 sparse attributes and 19 non-sparse attributes.

We tested the extremely case of ”SELECT * WHERE xxx” query where all attributes

were selected when some conditions were met. Results are shown in Figure 4.9. The

experimental results showed that:

1. Separating sparse attributes from non-sparse attributes led to better performance.

Non-sparse attributes are ones that show up in every record in the dataset, and

when a record was selected they all needed to be accessed. Therefore it was best to

have them in one partition to achieve better locality and avoid the negative impact

of null values.

2. There’s an optimal partition size except when the selectivity was really low, e.g.,

1%. Figure 4.10 shows the number of null values in the entire table as a function

of partition size. Layout with partition size of 1 had no null values, but as the

partition size increased, the number of null values also increased dramatically with

a rate similar to that observed in Figure 4.9. The increase in the number of null

values was the driving force behind the rapid increase in execution time when the

partition size exceeded a certain threshold. At the opposite extreme, if the partition

size was too small, the cost of having too many partitions outweighed the benefit

of having little null values.

4.3.3 Summary

In summary, we found that it is the best to separate sparse and non-sparse attributes,

and segment sparse attributes into partitions of smaller size. Having partition size that

is too large would lead to too many null values, degrading the cache locality. In contrast,

a layout that’s too fine-grained leads to extensive post-select record ordering time that


outweighs the benefits of having little null values. In the case of the NoBench dataset,

the optimal partition size for the sparse attributes is around 5 to 10.

4.4 Guidelines in Partitioning

In summary, three case studies were performed to reach some guidelines in how to better

partition data in different scenarios. These guidelines were used in the design of the par-

titioning algorithm described in Section 3.5. Through these experiments, we discovered

that:

1. Given a query, it is best to isolate all of the attributes that it selects into a partition,

without breaking them apart or introducing other attributes that are not accessed

into the partition.

2. When the select clause of a query contains a large number of attributes and does

not include the attribute that is used by the where clause for condition evaluation,

it should be put into a separate partition. Otherwise, it should be placed into the

same partition as attributes in the select clause.

3. Data sparseness is an important factor in data partitioning. Sparse attributes

should be isolated from non-sparse attributes. Furthermore, sparse attributes in

the dataset should be segmented into multiple partitions with small partition size.

Specifically, based on our profiling, the optimal size for the NoBench dataset is 5

to 10 attributes per partition.


Figure 4.8: Comparison between row-based, column-based and different vertical parti-tioned layouts in different datasets with different amount of sparse data


Figure 4.9: Comparison between row-based, column-based and different vertical parti-tioned layouts when executing the query ”SELECT * WHERE something” using theentire NoBench dataset with 1000 sparse attributes


Figure 4.10: Number of nulls values as a function of partition size when storing theNoBench dataset

Chapter 5

Dynamic Vertical Partitioning in

Multi-Type Workloads

In Chapter 4, several partitioning guidelines were developed based on our case studies

and profiling. These case studies were done by investigating the outcomes of different

partitioning practices given a particular query. These studies were useful in helping us to

gain some insights and partitioning guidelines. However, in a realistic workload the users

would not just execute only one query. Instead, multiple different queries of different

types are executed over a period of time. In this project, a workload is defined as a series

of different queries to be executed by the engine one by one. Our goal is to find the best

storage method that would take the least amount of time to execute all queries in the

workload sequentially.

One problem that arose was that in a workload there could be overlap in attributes

that are accessed by different fields. For example, a subset of queries in the workload

might access attributes A and B while some other queries would access attributes B and

C together, making attribute B a hot spot. The best storage layout might be different

for both cases. However, over the execution a workload, it is impossible to build and

load data into an entirely different layout every time a different query is executed.

62

Chapter 5. Dynamic Vertical Partitioning in Multi-Type Workloads 63

Therefore, we developed a partitioning algorithm, which accepts workload and data

characteristics as input and generates the best vertical partitioning layout that can

achieve our goal of having the best overall performance for the workload.

In this chapter, we evaluate our system’s performance in workloads with multiple

conflicting query types.

5.1 Evaluation Methodology

In this section, we describe the methodology we used to evaluate our hybrid partitioning

storage in workloads with multiple query types. Our workloads should have the following

characteristics:

1. It should contain a large list of queries with multiple query types, each with differ-

ent frequency of occurrence, mimicking real life situation where many queries are

queued up to be executed as a batch over a period of time. In the original NoBench

micro-benchmark as well as its experiments, queries are tested in isolation of each

other and only the average time of each query is reported. But in reality, over

a period of time, the application might run different query types, with different

frequencies.

2. We needed the different query types to have some conflicts with each other. For

example, multiple query types would try to access the same ”hot” attribute. We

wanted to test if the partitioning algorithm is intelligent enough to produce a

partitioning layout that leads to an optimal overall performance.

3. In real life situations, there are an infinite number of possibilities on how different

queries in the workload conflict with each other. For example, given the same set

of queries, during different time they can have different frequencies and selectivity

because users’ querying behaviours and the underlying data characteristics can


change. Also, the level of conflict among queries within a period of time can be

different from time to time. Therefore, our evaluation methodology should cover

many different possibilities.

For these reasons, we decided to perform a large number of randomized experiments to

cover different possibilities and see how our system performs in different scenarios without

manual bias. In each experiment, a different workload is generated based on several sets

of workload templates and two randomized parameters: selectivity and frequency of each

query type in the workload templates.

Figure 5.1: Our Methodology for generating 150 independent randomized workloads.

For subsequent experiments in this chapter, a workload is generated for each experi-

ment using a procedure summarised in Figure 5.1. In a nutshell, we first came up with

three sets of workload templates with different levels of conflict among queries, as shown


in the top third of the Figure. We chose to have three different sets of workload templates

to test different situations with various levels of conflict among queries. To generate the

workload for one experiment, we first picked one of the three workload templates. For

each of the query in the template, a frequency and a selectivity value were randomly

generated, as depicted by the middle part of Figure 5.1. Lastly, we generated a workload

with a desired size by duplicating each query type based on its frequency in random

order.

This process was repeated many times, generating a different workload every time for

us to test different situations. In total, we performed 150 of these randomized experi-

ments. Out of these 150 experiments, 50 were generated using each of the three workload

templates of different conflict levels.

We performed each experiment by executing the queries in the workload sequentially

and measuring the time it takes to finish executing all queries in the workload using

the vertical partitioning storage with layouts generated by the partitioner, with alpha

values of 0.1, 0.3, 0.5, 0.7, 0.9, as well as traditional columnar and row-based methods.

In addition, we also tested each workload on a manually generated vertical partitioned

layout which we call ”improved row-based”, with all non-sparse attributes in a row-

based layout and all sparse attributes in groups of 10. We ran this layout as a manual

improvement over the traditional row-based method, which, as we observe later, have

very poor performance due to large number of null values. The partitioning algorithm

was expected to be intelligent so that under different situations, it can generate layouts

that are better than the manually generated improved row-based layout.

These 150 experiments were performed in isolation from each other. It allows us

to see how the hybrid partitioning storage and the partitioning algorithm perform in

different isolated situations. However, another key aspect that differentiates our system

from traditional storage methods and other vertical partitioning storage methods, such

as the Hyrise system [6], is the adaptation of storage layout in response to changes in


workloads. In real-life situations, workload characteristics are subject to changes induced

by changes in data characteristics and users’ querying behaviours. For example, the

conflict level of queries in a workload might dramatically increase over a time period due

to some real world events that cause more users to become interested in the same set

of attributes. For this reason, we also performed an experiment where three different

workloads are combined into one larger workload by appending one after another. The

resulting workload contained two time points where workload characteristics suddenly

changed. We then manually triggered re-partitioning at those time points. To avoid

having to pause the process to wait for the re-partitioning process to finish, once triggered,

the re-partitioning process is executed in a parallel thread. The process would trigger

the partitioning algorithm to generate a new layout for the new workload characteristics

and the system would then construct new tables based on the new layout if there is any

difference between the new and old layouts. Once the re-partitioning is done, it would

signal the storage engine to switch to the newly built tables and remove references to old

tables.

To summarise our evaluation methodology, we compared the traditional storage meth-

ods with hybrid partitioning storage, with layouts generated by the partitioning algorithm

using different alpha values, in 150 runs of experiment. Each experiment had different

workload characteristics generated by randomizing two parameters. We then performed

an experiment where workload characteristics changed twice during a workload to see

how our system adapts to changes.

5.2 Results

As mentioned, we performed a set of 150 experiments to evaluate vertical partitioning in

different workloads. In this section, we first show some statistics on the 150 experiments

as a high level overview. We then study one of the experiments to gain deeper insights


Figure 5.2: Average improvement over traditional storage format using different alphavalues

on the behaviours of the partitioning. We conclude this section with the result for the

adaptation experiment at the end of this section.

5.2.1 Sensitivity to Alpha Value in the Partitioning Algorithm

Figure 5.2 and Figure 5.3 summarise the results of the 150 experiments. Figure 5.2

shows the average improvement of vertical partitioning, using different alpha values for

the partitioning algorithm, over the best of column-based, row-based, and improved row-

based. It shows that the alpha value is an important parameter, and under different

conflict levels, different alpha values can lead to different results. For example, alpha

value 0.1 generated on average near 35% improvement over the best of the traditional

storage methods under workload with little to no conflicts among queries, but suffered on

high conflict workloads with 8% poorer performance than traditional ones. Higher alpha

values, such as 0.7 and 0.9, produced much more consistent improvement in workloads

with different conflict levels. They achieved about 20% improvement on average, despite

not doing as well as lower alpha values in workloads without conflict.


Figure 5.3: % of time when vertical partitioned layouts outperform traditional formats

Figure 5.3 shows that for each conflict level, the percentage of instances where the

vertical partitioned layouts produced by different alpha values outperformed the best of

column-based, row-based, and improved row-based. Figure 5.2 and 5.3 are different in

that Figure 5.3 shows how often vertical partitioning outperformed traditional methods

and Figure 5.2 shows by how much the it outperformed. Figure 5.3 shows that, similar

to what was observed from Figure 5.2, lower alpha values performed well only when

there was no conflict among queries. In low conflict levels, shown in the middle of the

figure, alpha 0.1 performed worse than the best of traditional methods nearly 50% of

the time. The problem was worse as the conflict level increased such that alpha 0.1 only

outperformed traditional methods less than 30% of the time. In contrast, higher alpha

values, such as 0.7 and 0.9, consistently led to better performance despite the conflict

level among queries such that they outperformed the best of traditional methods over

90% of the time.

To sum up this part of our discussion, when choosing the alpha value for the par-

titioner, our results suggest that choosing higher alpha values can consistently achieve

better performance, both against traditional storage methods and lower alpha values.


Next, let us examine one of the 150 experiments in greater detail.

5.2.2 Detailed Study of one Experiment Out of 150

Figure 5.4 and 5.5 show one of the 150 experiments in greater detail. The top chart in

Figure 5.4 shows the comparison of total runtime of the workload using different storage

methods. The second chart in the figure shows the average time each storage method

spent on each type of query in the workload, with the y-axis adjusted into logarithmic

scale. Figure 5.5 shows the three different partitioning layouts generated by different

alpha values. Several observations can be made from Figure 5.4 and 5.5.

First of all, this workload was generated using the high conflict level workload tem-

plate such that there were a lot of overlaps among different query types. Therefore the

partitioning algorithm sometimes had to make choices to sacrifice queries with less weight

for the benefit of the overall runtime. From the runtime breakdown in Figure 5.4, none of

the hybrid partitioning layouts was able to achieve better performance than traditional

methods in every query. Yet, they all beat the best of traditional methods in terms of

overall execution time of the entire workload. Among the traditional storage methods,

the improved row-based storage performed the best. Vertical partitioning outperformed

the improved row-based by at least 23% and up to 38%, in the case of alpha 0.1 and 0.9,

respectively.

The execution time breakdown in Figure 5.4 reveals some details on the partitioning

outcomes and their effects on execution time on each query type. For example, query

type 1, 2, and 3 all accessed the attribute ”str1”, making it a hot spot. Using alpha 0.1

the partitioning algorithm put ”str1” alone in a separate partitioning, which favoured

and led to better performance in query type 1, but poorer performance in query type

3. In contrast, alpha 0.9 put ”str1” with other attributes that are accessed together in

query type 3, which led to poorer performance in query 1 but good outcome in query type

3. However, since query type 3 had much higher execution time and frequency in the


Figure 5.4: Total runtime and runtime breakdown for experiment 1 results


Figure 5.5: Layouts generated using different alpha values in experiment 1

workload, a partitioning layout that favoured query 3 would be more beneficial overall.

Similarly, column-based worked well for query types that selected a small number of

attributes, such as query type 1, and the improved row-based performed well for query

types that accessed more attributes, such as query 5 and 7. But overall, since these

traditional storage methods lack the flexibility to tailor data storage pattern based on

accessed patterns of workload, they ultimately failed to match the performance of the

vertical partitioning layouts in terms of overall execution time.

5.2.3 Evaluation of Dynamic Adaptation

As mentioned previously, we then performed an experiment where the workload charac-

teristics changed twice during a larger workload. The result of this experiment is shown in

Figure 5.6. The y-axis shows the running average of the runtimes of the most recent 500

queries. Workload characteristics changes occurred at time A and C, and re-partitioning

started at the same time and was processed in a parallel thread. The re-partitioning

processes finished at time B and D. During the time between the start and end of the

re-partitioning process, query execution continued with the old partitioning layout while


Figure 5.6: Workload adaptation of dynamic vertical partitioning

new tables were being built. The outcome of this effect is most noticeable at time C and

D. At time C, the workload characteristics changed, but since the layout optimized for the

previous workload was still being used, the execution performance of hybrid partitioning

storage took a hit. Fortunately, after time D when the dynamic vertical partitioning

storage switched to a new layout, the performance improved, which is observable by the

drop in the runtime starting at time D.

Over the entire experiment, thanks to the partitioning algorithm’s ability to pick an

optimal layout, given different workload characteristics, and the re-partitioning process,

the variation in runtime is much smaller than the traditional approaches. The system

was able to consistently outperform others despite changes in workload characteristics,

which is an important feature in real-life applications where real-life events can induces

changes in queries and data characteristics.

Chapter 6

Related Work

Work related to this thesis dissertation falls into two areas: 1) in-memory relational

databases in general and 2) in-memory databases that also vertically partitions data to

optimize performance.

6.1 In-memory Relational Databases

There are a number of systems that had been designed for real-time analytics. As

mentioned before, Scuba is a fast, distributed, and in-memory database developed by

Facebook for most of its real-time analysis [1]. Data are ingested from various events oc-

curring in Facebook’s code base through Scribe, which is a distributed messaging system

for collecting, aggregating, and sending large volumes of log data with low latency. As

data arrived there are stored in the main memory of two leaves nodes at their cluster.

Schema at each node is created automatically the when it first receive some data. Data

are deleted at the same rate they receive new data since they mostly just need the most

updated data. They support simple selection and aggregation queries, just like we do,

without the support for join query since they assume joining is usually done to combine

data from multiple sources before they are ingested into Scuba. While we are still lacking

some features that Scuba has, such data aging and fast data ingestion, the scuba system

73

Chapter 6. Related Work 74

is most similar to what we are aiming to achieve in this project. The major difference is

that Scuba employs a pure row-based layout for its data in main memory, and as we have

seen a pure row-based is not the best strategy for some scenarios. We believe that with

our hybrid strategy, coupled with future plans to implement the additional features men-

tioned, our system will have better performance with the flexibility to adapt to different

workload than the Scuba system. In fact, the Scuba publication also admitted that a

major future project direction for them is to investigate column-based storage, instead

of just a row-based store at it is now. However, to the best of our knowledge, they have

not published work following up the original Scuba paper.

H-Store is another example of high-performance and distributed main memory database

management system developed a couple years ago, and was built upon by other to come

up with S-Store, a data management system for streaming processing. However, both

systems are designed for OLTP applications, which makes them less ideal for OLAP and

mixed OLTP/OLAP workloads [8, 2].

Hyper, developed in 2011, is a hybrid OLTP/OLAP main memory database system

based on virtual memory snapshots [9]. However it is unclear how it could adapt to

changes in the workload, and how it would handle semi-structured data.

The SAP HANA is another in-memory database in this space designed for both ana-

lytical and transactional workload [5]. HANA optimizes memory usage by data compres-

sion and changing data store method for different applications, i.e. it changes between

column store and row store based on data accessed pattern. However, it is different from

our work since they only offer the two extreme based on workload, but not a hybrid

approach to further optimize the performance. They reply on system administrators to

specify at definition time whether a row-based or columnar format should be used.

Powerdrill is another data managemenet system designed by Google for analytics and

is reported to have very fast query execution time comparing to other analytics systems

developed by Google [7]. However, like many other system it only uses a column-wise

Chapter 6. Related Work 75

storage, make it less suitable for some workloads. Similarly, Shark, which uses coarse-

grained distributed memory for fast data analysis over Hive with emphasizes on result

completeness and failure recovery, only uses a column-base approach.

6.2 In-memory Relational Databases with Vertical

Partitioning

To the best of our knowledge, the only in-memory database study that have also consider

the use of a hybrid data store format is Hyrise [6]. It developed a complex cache model

and an algorithm to find the best vertical partition layout based on the cache model. The

rationale is that for in-memory database cache performance is the dominant factor that

affects query execution performance and basing the hybrid partitioning decision on cache

modelling would result in the best layout with high confidence. However, our work is

different because instead of basing the search for partitioning layout on a mathematical

model of cache misses, we take advantages of some basic guidelines that we obtained

through our study to limit the search space, making our partitioner much less time-

consuming. This is especially important given semi-structured data with a lot sparse

attributes. As mentioned before, we tried to run Hyrise on the JSON dataset, NoBench

dataset, and the program did not finish after a long time that we had to terminate it.

Chapter 7

Conclusion and Future Work

In this thesis dissertation, we developed a simple in-memory database system for real-

time analytics on semi-structured data. While this system still lacks features that are

commercially available in enterprise systems, it allowed us to investigate the impact

of data layout on query performance of semi-structured data stored in an in-memory

database.

Using our prototype implementation, we conducted three in depth case studies to

investigate the impact of different partitioning practices and gain some guidelines in how

to best partition JSON data vertically given different query workloads. We showed that

while the columnar and row-based approach perform well in their respective paradigms,

our vertical partitioning approach can achieve the best of both worlds and can signif-

icantly improve query execution performance over traditional columnar and row-based

approaches.

We further used these partitioning guidelines in the design of a partitioning algorithm

that intelligently produces the optimal partitioning layout given the characteristics of a

dataset and query workload. Through repeated randomized experiments, we have shown

that our system can consistently outperform the traditional methods.

More importantly, one important novelty in our system, compared to other projects

76

Chapter 7. Conclusion and Future Work 77

that also vertically partition dataset in main memory, is in its ability to change it’s

partition layout incrementally to adapt to changes in data and workload characteristics.

In summary, we made the following contributions in this dissertation:

1. Designed and implemented an in-memory database system that provides future

researchers with a tool to further study different aspects of real-time data analytics

on semi-structured data.

2. Conducted in depth studies and identified guidelines on vertical partitioning in

different situations, which were used by other researchers to develop an intelligent

partitioning algorithm.

3. Evaluated the performance of the system, including the partitioning algorithm, and

showed its effectiveness in different workloads.

Going forward, our future work will focus on the following areas:

1. We will finish implementing and incorporating features such as periodic data aging,

different compression policies, and different indexing methods. These features were

not the focus on this thesis dissertation, but are important for further improving

the performance of our in-memory database system. These implementations will

have priority over other features such as scalability, fault recovery, and availability

of data.

2. The re-partitioning process was manually triggered in our adaptation experiment,

as we know at which time points the workload characteristics started to change.

However, in reality, applications might not have that knowledge in advance. As part

of our future work, we will implement automatic re-partitioning based on profiling

on the fly.

3. We will investigate better ways to perform experiments in mixed workload with

semi-structured data. Currently we are using the NoBench dataset that was in-

Chapter 7. Conclusion and Future Work 78

troduced by Argo. However, NoBench does not provide a real-life dataset and

workload, and might not be representative enough for real-life applications. For

example, in the NoBench dataset all sparse attributes have the same degree of

sparseness ratio across the dataset. This is not likely to happen. In real-life there

might not be a clear cut distinction between sparse and non-sparse attributes, and

some attributes would be somewhere in between the two extremes. In the future,

we will explore other more realistic data and workloads as they become available

to researchers.

Bibliography

[1] Lior Abraham, John Allen, Oleksandr Barykin, Vinayak Borkar, Bhuwan Chopra,

Ciprian Gerea, Daniel Merl, Josh Metzler, David Reiss, Subbu Subramanian,

Janet L. Wiener, and Okay Zed. Scuba: Diving into data at facebook. Proc. VLDB

Endow., 6(11):1057–1067, August 2013.

[2] Ugur Cetintemel, Jiang Du, Tim Kraska, Samuel Madden, David Maier, John Mee-

han, Andrew Pavlo, Michael Stonebraker, Erik Sutherland, Nesime Tatbul, Kristin

Tufte, Hao Wang, and Stanley Zdonik. S-store: A streaming newsql system for big

velocity applications. Proc. VLDB Endow., 7(13):1633–1636, August 2014.

[3] Craig Chasseur, Yinan Li, and Jignesh M. Patel. Enabling json document stores in

relational systems, 2013.

[4] Cliff Engle, Antonio Lupher, Reynold Xin, Matei Zaharia, Michael J. Franklin, Scott

Shenker, and Ion Stoica. Shark: Fast data analysis using coarse-grained distributed

memory. In Proceedings of the 2012 ACM SIGMOD International Conference on

Management of Data, SIGMOD ’12, pages 689–692, New York, NY, USA, 2012.

ACM.

[5] Franz Farber, Sang Kyun Cha, Jurgen Primsch, Christof Bornhovd, Stefan Sigg,

and Wolfgang Lehner. Sap hana database: Data management for modern business

applications. SIGMOD Rec., 40(4):45–51, January 2012.

79

Bibliography 80

[6] Martin Grund, Jens Kruger, Hasso Plattner, Alexander Zeier, Philippe Cudre-

Mauroux, and Samuel Madden. Hyrise: A main memory hybrid storage engine.

Proc. VLDB Endow., 4(2):105–116, November 2010.

[7] Alex Hall, Olaf Bachmann, Robert Buessow, Silviu-Ionut Ganceanu, and Marc

Nunkesser. Processing a trillion cells per mouse click. PVLDB, 5:1436–1446, 2012.

[8] Robert Kallman, Hideaki Kimura, Jonathan Natkins, Andrew Pavlo, Alex Rasin,

Stanley B. Zdonik, Evan P. C. Jones, Samuel Madden, Michael Stonebraker, Yang

Zhang, John Hugg, and Daniel J. Abadi. H-store: a high-performance, distributed

main memory transaction processing system. PVLDB, 1(2):1496–1499, 2008.

[9] A. Kemper and T. Neumann. Hyper: A hybrid oltp amp;olap main memory database

system based on virtual memory snapshots. In Data Engineering (ICDE), 2011

IEEE 27th International Conference on, pages 195–206, April 2011.

[10] Marcel Kornacker and Justin Erickson. Cloudera impala: Real-time queries in

apache hadoop, for real, October 2012.

[11] Jens Krueger, Christian Tinnefeld, Martin Grund, Alexander Zeier, and Hasso Plat-

tner. A case for online mixed workload processing. In Proceedings of the Third

International Workshop on Testing Database Systems, DBTest ’10, pages 8:1–8:6,

New York, NY, USA, 2010. ACM.

[12] Peter Lake and Paul Crowther. Concise Guide to Databases. Springer-Verlag Lon-

don, 1 edition, 2013.

[13] Sergey Melnik, Andrey Gubarev, Jing Jing Long, Geoffrey Romer, Shiva Shivaku-

mar, Matt Tolton, and Theo Vassilakis. Dremel: Interactive analysis of web-scale

datasets. In Proc. of the 36th Int’l Conf on Very Large Data Bases, pages 330–339,

2010.

Bibliography 81

[14] Oracle. Oracle database in-memory, October 2014.

[15] Hasso Plattner. A common database approach for oltp and olap using an in-memory

column database. In Proceedings of the 2009 ACM SIGMOD International Confer-

ence on Management of Data, SIGMOD ’09, pages 1–2, New York, NY, USA, 2009.

ACM.

[16] Hasso Plattner and Alexander Zeier. In-Memory Data Management: Technology

and Applications. Springer-Verlag Berlin Heidelberg, 2nd edition, 2012.

[17] Maren Steinkamp and Tobias Mhlbauer. Hydash: Adashboard forreal-time business

intelligence based on the hyper main memory database system.

[18] T. Suzumura and T. Oiki. Streamweb: Real-time web monitoring with stream

computing. In Web Services (ICWS), 2011 IEEE International Conference on, pages

620–627, July 2011.

[19] E. Thomsen. OLAP Solutions: Building Multidimensional Information. Wiley, New

York, 2002.

[20] H. Zhang, G. Chen, B. Ooi, K. Tan, and M. Zhang. In-memory big data management

and processing: A survey. Knowledge and Data Engineering, IEEE Transactions on,

PP(99):1–1, 2015.

a study of an in-memory database system for real-time ... · services, such as twitter, now provide...

Documents