data processing in cassandra vs mysql: a comparative ...iahpc.ir/paperspdf/85.pdf · the cassandra...

1

Data processing in Cassandra vs MySQL: A comparative

analysis in the query performance

Seyyed Ali Hosseini1, Fereshteh-Azadi Parand1, Farzam Matinfar1

1 Allameh Tabataba'i University, Tehran, Iran

{ ali_hosseini, parand, f.matinfar}@atu.ac.ir

Abstract. Today, data generated massive, in extreme rates of speed and varies.

Besides development of the Internet and social networks, electronic device, sen-

sors and even turbines generate several gigabytes of data each day. In most ap-

plications, the relational database management systems (RDBMS) are responsi-

ble for storing and managing these data. But these databases cannot handle this

volume of data efficiently due to architectural issues. So, other suitable databases

should be replaced to meet the needs of storing and quick accessing to massive

data. In the last few years, another class of database management systems called

NoSQL are growing in popularity for managing the huge amount of data. This

class of databases can handle a large amount of data and increase the speed of

access to data. In this paper, we will describe the NoSQL databases and become

familiar with the structure of Cassandra, a well-known NoSQL database, and

finally, we compare its performance with the MySQL relational database and

present experimental result.

Keywords: NoSQL, RDBMS, Comparison, Cassandra, Big Data, Column

Family, Distributed Database.

1 Introduction

Dr. Edgar F. Codd, in 1970, when he was a staff member at the IBM Research Institute,

introduced the theory of data relational modeling in an article entitled A Relational

Model of Data for Large Shared Data Banks [1]. This article became the basis of the

work of the Relational Database Management System (RDBMS). Relational databases

are still one of the most popular applications in computer history. But today, due to the

exponential growth of data, a new domain named Big Data is introduced. Big Data is a

property that is produced at high speed, their varieties are different, some are structured,

semi-structured and unstructured and their volumes are more than several TB or PB.

Limitations such as architectural, data model and scalability make the relational data-

base unable to support the rapid growth of data and their variety. These limitations lead

to the creation of other platforms in Big Data.

The NoSQL databases are well known in the Big Data and Internet of things appli-

cations. NoSQL stands for Not Only SQL. A term applied to some database manage-

ment systems. Unlike relational databases, this type of databases does not require a

tabular structure for storing data. Manages thousands of terabytes of data and provides

faster access to massive data. These databases are distributed. Two unique features of

mailto:%[email protected]

2

this class of databases are replication and partitioning. Replication enables developers

to replicate their data on multiple servers and continue to access data if a server fails.

Partitioning makes it easy to distribute data across multiple servers in the cluster.

NoSQL databases are divided into four classifications based on the storing method:

Column Based. such as HBase and Cassandra

Key-Value. such as Redis

Document-Oriented. like MongoDB

Graph Oriented. like neo4j

In this paper, we describe the structure and examine the performance of Cassandra,

a column-based NoSQL database, and compare it with a relational database called

MySQL. MySQL is an open source database developed and supported by Oracle.

The rest of paper is organized as follows. In the second section, we will introduce

the Cassandra data model and structure, and the query language of this database called

CQL. In the next section, we will discuss distribution in Cassandra. Section 4, describes

the experimental environment and the data used to compare two databases. In Section

5, we present experimental results of running the queries in both two databases, and

finally, in Section 6, we conclude this paper and show the future work.

2 Related Work

NoSQL databases are next generation Databases mostly addressing some of the points:

being non-relational, distributed, open-source and horizontally scalable. The original

intention has been modern web-scale databases. The movement began early 2009 and

is growing rapidly. Today, there are 18 free and widely used Open Source NoSQL Da-

tabases Jing Han et al classify NoSQL databases according to the CAP theorem and

describe the background, basic characteristics and data model of NoSQL [2]. Makris

et al. [3] and Atzeni et al. [4] analyze some NoSQL databases and describe classifica-

tion of NoSQL data stores based on key design characteristics.

Some studies compare the NoSQL databases with other database management sys-

tems, especially the relational databases. Kabakus and Kara [5] evaluate performance

of in-memory databases. The NoSQL databases used in this experiment are MongoDB,

Cassandra, Redis, and Memcached. They used the H2 relational database in this exper-

iment. Unlike the relational databases, the H2 stores data in memory. They compare

performances of databases through the four experiments: (1) time to write key-value

pairs, (2) time to read value corresponding to a given key, (3) to remove the key-value

pair corresponding to a given key, and (4) time to get all the data. They conclude that

Memcached clearly provides the best write performance. Redis uses memory more ef-

ficiently than others, and also fairly provides better performance for the read and delete

operations. MongoDB provides significantly the best performance to fetch the whole

data. van der Veen et al. [6] Compare the performance of the three PostgreSQL, Mon-

goDB, and Cassandra databases. They simulate sending data from several sensors to

compare database performance. Lee et al. [7] report comparing the performance of the

NoSQL and XML databases to store and retrieve clinics data.

3

In this paper, we intend to become deeply familiar with the structure and data model

of the Cassandra database, to explain its advantages and disadvantages. In an experi-

ment, we compare its performance with the MySQL database. We report the results of

this experiment and analyze each of the results and describe the conditions under which

cassandra database has better performance than the MySQL on one machine.

3 Cassandra structure

In this section, we will talk about how data stored in Cassandra. We describe Cassandra

primary key and its difference with relational databases. Finally, discuss about CQL

and its commands.

3.1 Cassandra Data Model

Fig. 1. A simple row in Cassandra table

Cassandra is a column family database. The column family is a NoSQL object contains

key-value pairs. Each key is mapped to a set of columns. This object borrowed some

relational database characteristics. The column family is a table that each pair of the

key-value is a row. Rows can have several different columns. Each column contains

the name, the value, and the timestamp. And you can group several columns [8]. This

set of related columns is called a super column family. In fact, the column family has

expanded key-value. The Fig. 1 shows a simple row in Cassandra.

Fig. 2. Cassandra table contains different rows

In Cassandra, related rows are associated in a logical division as a table. For example,

we want to store user information in a table named user. Creating a table is similar to

the relational database, in which we set columns and their types when creating the table.

Inserting data to Cassandra table a bit different that do not need to add all the columns

4

each time we add a row or entity. For example, people have two phone numbers and

some do not. Or in web form some fields are required and some optional. Despite the

relational database for values that we do not know, we must store null, In Cassandra

never save column and our table is like Fig. 2.

The primary key in Cassandra is like a relational database. In addition, Cassandra

has a special primary key called a composite key, which includes a partition key and a

set of clustering columns. The partition key specifies the node on which row is stored,

which can contain multiple columns. The Clustering columns also specify how to store

data in a partition. Cassandra also has another structure called the static column, which

is shared to all the rows of a partition. Fig. 3 shows a wide row of Cassandra and how

partition key and clustering columns affect storing data [9].

Fig. 3. Cassandra wide row

3.2 CQL

This structure is suitable for time series data. The data that the sensors produce, the

tweets, the comments that users write under posts, and so on. To explain this structure

deeper and get familiar with CQL (Cassandra Query Language), let's take an example

of building a table and storing weather station data [10]. The command to create the

table in CQL is as follows. The Fig.4 illustrates the table created by these commands.

CREATE TABLE temperature_by_day (

weatherstation_id text,

date text,

event_time timestamp,

temperature text,

PRIMARY KEY ((weatherstation_id, date), event_time)

);

Fig. 4. The simplest model for storing time series data for each source

5

As seen in the definition of the primary key, the two columns are placed in parentheses.

And the primary key is divided into two parts. the first part, which is in parentheses, is

the partition key and the rest of the columns are clustering keys. This structure removes

the redundancy and provides quick access to the data. An example of data insertion in

the table as follows.

INSERT INTO temperature (weatherstation_id, event_time, temperature)

VALUES (’1234ABCD’, ’2013-04-03 07:02:00′, ’73F’);

As you can see, CQL is very similar to SQL. Other commands, like insert and update,

are similar to SQL commands. But Cassandra does not support the join command, and

comparative operators can be used just on the last clustering column.

4 Cassandra Distribution

Cassandra is a free open-source distributed wide column store NoSQL database man-

agement system designed to handle large amounts of data across many commodity

servers, providing high availability with no single point of failure. It means that a logi-

cal database is stored and divided into one or more machines, each machine called node.

This database uses peer-to-peer architecture. Nodes connected to each other and create

a cluster. Cassandra has provided two grouping for the topology of clusters: rack and

data center. The rack is a logical set of nodes near each other. The data center is the

logical set of racks.

As in the data model section, Cassandra stores and accesses data with a primary key

or composite key. Because data is split between several nodes, Cassandra uses a dis-

tributed hash table (DHT) for efficient and fast access to data. In DHT, do not need to

ask each node whether it contains a key. And also does not need all nodes available to

prove that the key does not exist. It maps the key to the node store it.

Fig. 5. Data distribution and token assignment in Cassandra

But if we want to remove or add a node, the old hash function cannot be used. To solve

this problem, the Consistent hashing algorithm is used. The goal of this algorithm is

6

that each node can efficiently locate the location of each key, despite the constant de-

letion and addition of nodes within the cluster. In this way, each node that is arranged

in a ring together includes a range of keys. Fig. 5 shows the five-node form a cluster

and store data in the belong node [11].

As mentioned, since Cassandra uses peer-to-peer architecture, all nodes are inter-

connected. The client connects to one of the nodes. The node receiving the request is

called the coordinator. All nodes can play this role. If the key is not related to the coor-

dinator range, the request is sent to the other node whose key is associated.

In addition to partitioning the database, Cassandra has another feature called repli-

cation. This means that for each data, multiple copies are created in the cluster. We can

set the number of copies by the replication factor in Cassandra. This feature makes it

possible to guarantee availability in a node failure. In addition, replication causes more

than one machine involved in adding or removing nodes to migrate data, therefore per-

formance increased.

5 Cassandra Read and Write

The Cassandra database architecture is different from relational databases. Its primary

key is defined differently. It's a distributed database and uses an advanced hash function

to store and access data across multiple servers. And stores the data in memory and

disk.

Now, in this section, we want to look deeper into the details of the database archi-

tecture and Cassandra read and write operation. In the following, we describe the whole

scenario of reading and writing operations from the request sent by the client until read-

ing the files tables stored in nodes.

5.1 Write Operation

To write, the client connects to the coordinator. This node delegates the request to a

service called storageProxy. The storageProxy's job is to identify all nodes responsible

for storing this data. When the replica nodes are specified, storageProxy sends messages

to all of them. The service then waits for the response from the nodes that responsible.

Now, we describe the writing within the node.

When the write operation is performed, the data is immediately written in the commit

log. Commit log is a crash-recovery mechanism that supports Cassandra's durability

goal. The write operation will not succeed until it is written in the commit log. If the

database crash or shut down, the commit log ensures that the data is not lost. Because

when the node starts to work, the commit log is read and this is the only time this file

is read and users do not have access to this file.

After data is written in the commit log, the data is written to a memory-resident data

structure called the memTable. memTable is a structure in memory. MemTables are

immutable and there may be several memTables for each table. memTable contains

data for a specific table.

7

When the number of objects stored in the memTable reaches the threshold, its con-

tents are flushed to disk in a file named SSTable. Then a new memTable is created.

Each commit log has a bit flag indicating whether it needs flushing or not. When a write

is received, it is first written in commit log and its bit flag is 1. After the memTable is

flush on the disk, the corresponding bit flag equals 0.

There is no read or seek before write. It is one of the reasons that writing in Cassandra

is better. every writes operation in Cassandra is an append action.

SSTables are immutable. And only compacted. In compaction, SSTables of a table

merges together. SSTables get sorted in reverse chronological order (latest first).

Each SSTable has a corresponding BloomFilter stored in memory. Bloom filter is

used to boost the read operation. Bloom filter is a non-deterministic and fast algorithm

that tests whether an element is a member of a set.

5.2 Read Operation

Like the write operation, the client sends the read requests to the coordinator and coor-

dinator delegates it to proxyStorage. ProxyStorage finds a list of the replica nodes and

specifies the proximity node containing the key by snitch function. Snitch function has

seven different types to find the closest node. After the closest node is found, the query

is sent to it and the other replica nodes will send a digest. Let's explain the scenario that

happens to retrieve data inside the node.

When a read request is sent, at first memTables are read, which is very fast. If the

necessary data is not found, Cassandra will search SSTable. Since disk access is typi-

cally much slower than memory access and each SSTable has a corresponding bloom

filter stored in memory, Bloom filter is checked first before accessing disk. Apart from

the Bloom filter for Partition keys, there is another bloom filter for each row that spec-

ifies whether the column exists in SSTable or not. Now, Cassandra will take SSTables

one by one from younger to older [12].

Fig. 6. Reading data of k2 key. A2 is updated with AA2 and C2 is updated with CC2[13].

8

6 Experimental Results

In this section, we will report the practical comparison results of Cassandra and MySQL

databases. We also describe the experimental environment and data used in this com-

parison.

6.1 Experimental Environment

Since each NoSQL Database for a set of data models are very efficient and Cassandra

manage efficiently data that generated over time, such as sensors and logs, the data used

for comparing is a simulation of weather station sensors. These data are made with an

application written by java. To compare these two databases, we create a table contain-

ing four columns with the following command with index in MySQL.

CREATE TABLE climate (

city varchar (20),

date date,

time time,

temperature int (2)

);

create index first on climate (city, date);

And following commands creates the corresponding table in Cassandra. This table has

a composite key suitable for specific queries.

CREATE TABLE climate (

city text,

date date,

time time,

temperature int,

PRIMARY KEY ((city, date), time)

);

In this evaluation, we will compare only insert and select queries. For writing, an

application written by Java creates desired number of insert queries and store them in a

file. This program increases the time field for each query to a certain value. For exam-

ple, the program can be configured to simulate the sending of the temperature from

sensors every 20 seconds. As a result, the composite key in the Cassandra database will

be unique and the number of rows in the table in both databases will be the same after

several consecutive insert commands. Since the insert command is similar in CQL and

SQL a file can be used for inserting data in both databases.

In this evaluation, we use the executeAsync method in the Cassandra java driver for

insert queries. This feature allows the application to send non-blocking commands to

Cassandra. Because when the request is sent to Cassandra, the process of running the

program resumes and the program does not wait until the response from the database is

received. In the meantime, with java future feature, a thread created and wait until da-

tabase sends a response, then the desired action in case of failure or success of the insert

9

query. To evaluate the select query, we have created four select queries that are listed

in Table. 1.

Table 1. select queries performed in evaluation.

Identifier Query

Select 1 SELECT * FROM climate WHERE city 'New York' AND date

= '2016-01-04'

Select 2 SELECT * FROM climate WHERE city = 'Los Angeles' AND

date IN ('2016-01-01', '2016-01-02', '2016-01-03')

AND time >= '06:00:00' AND time <= '17:00:00'

Select 3 SELECT * FROM climate WHERE city IN ('New Orleans',

'Austin', 'Chicago') AND date = '2016-01-03'

Select 4 SELECT * FROM climate WHERE city IN ('Atlanta',

'Boston', 'New Orleans') AND date IN ('2016-01-01',

'2016-01-02', '2016-01-03')

In other evaluation, we conclude that increasing the number of table rows without

changing the time interval of sending temperature by the sensors, has no effect on the

query execution time. Only increasing the number of resulting rows of select query

affect the query execution time. Because both databases use the index to retrieve infor-

mation.

So the application creates the insert command was configured to create 5 different

tables, each table simulate sending temperature by sensor every 1, 2, 5, 10 and 20 sec-

onds. Since the concept of queries is getting temperatures daily or at a certain time, the

number of returning rows increases with a decline in seconds. Table. 2 gives the exper-

imental environment and configuration.

Table 2. The environment information in detail.

Component Information

Hardware and OS CPU

Memory

OS

Intel Core i7-2600, 3.40GHz × 4

12 GB

Ubuntu 16.04 LTS

Implementation RDB

NoSQL

Programing language

MySQL

Cassandra

Java 8

6.2 Evaluation Results

In this evaluation, we measured the runtime from executing the first query to receiving

the last query response. The Fig. 7 shows the result of the execution of the insert query.

10

The horizontal axis shows the number of consecutive insert queries or the number of

rows in the table. The vertical axis indicates the program execution time in millisec-

onds.

Fig. 7. The execution time of insert query

As seen, the insert operation in Cassandra is faster than MySQL due to the use of the

Async method and storage structure in Cassandra, where the data is first stored in

memory. In a large amount of data, this gap increased.

We observe MySQL only uses two threads of CPU with 70% utilization. But the

Cassandra database uses 92% utilization of all eight CPU threads. Since application

uses Async method, it uses 4 or 5 threads and the Cassandra database use 4 threads.

And this means that with Cassandra we can use multi-threaded programming and par-

allelization.

The figures in Fig. 8 shows the runtime of the select queries. The horizontal axis

displays the number of rows returned in the query response and the vertical axis dis-

plays runtime in milliseconds.

100 k 500 k 1 m 2 m 5 m

Cassandra 8401 20845 35356 71223 189368

MySQL 7408 36945 74102 154040 380501

8401 20845 3535671223

189368

740836945

74102

154040

380501

0

100000

200000

300000

400000

MIL

LI S

ECO

ND

ROW COUNT

26184 176

359

920

93 66 94203

426

0200400600800

1000

4320 8640 17280 43200 86400

mill

i sec

on

d

Select 1

MySQL Cassandra

95 173 313738

2283

175 84 128336

557

0

5001000

1500

20002500

5943 11883 23763 59403 118803

mill

i sec

on

d

Select 2

MySQL Cassandra

11

Fig. 8. The execution time of select queries shown in Table. 1 on five different tables

To evaluate the select query, we first have a deeper look at the indexing in the relational

database. Relational databases use two types of indexing [14]: B-Tree and Hash.

B-Tree is used for queries where the comparison contains operators =, <, <=, >, >, =

or between and like. The time to access data in this algorithm is log (n). The Hash index

is used only for query contain the operator = and are very fast. This method does not

use for comparative operators. Also, the optimizer cannot use the hash index for

ORDERED BY operator [15]. In this research, we have used the B-Tree index in

MySQL. Because queries have comparative operators.

Observing the results related to the Select operations, Cassandra has a lower re-

sponse time for select 1 and 2 compared to MySQL. Cassandra accesses the data di-

vided into partitions with the key in O(1). Because Cassandra uses Bloom Filter, index

SSTable rows and sort rows by clustering columns in SSTable, While MySQL access

to data in O(log n) by using B-Tree index. This makes indexing stronger in Cassandra.

But in select 3 and 4, Cassandra isn’t better Because Cassandra has some limitations on

querying and one of them is using IN keyword. The IN condition is recommended on

the last column of the partition key. Using IN can degrade performance. usually, many

nodes must be queried. For example, in a single, local data center cluster with 30 nodes,

a replication factor of 3, a single key query goes out to two nodes, but if the query uses

the IN condition, the number of nodes being queried are most likely even higher, up to

20 nodes depending on where the keys fall in the token range.

7 Conclusions and future work

We presented a general overview of the Cassandra structure, such as the distribution of

data and queries in this paper. Cassandra has a column family architecture that is ap-

75 168 213463

1022

216 147 252

574

1198

0

300

600

900

1200

1500

12960 25920 51840 129600 259200

mill

i sec

on

d

Select 3

MySQL Cassandra

184 304 5021201

3463

309 332739

1894

3424

0

1000

2000

3000

4000

38880 77760 155520 388800 777600

mill

i sec

on

d

Select 4

MySQL Cassandra

12

propriate for time series data. This is clearly seen in the comparison between this data-

base and MySQL using Java language and weather station data. We also did not use the

Cassandra distribution feature in this experiment, which can boost its performance. We

compare the performance of two databases on one machine. As future work, this feature

of Cassandra can be considered. We also will evaluate other data models using the

properties of this database and make comparisons with other NoSQL databases. We

want to compare Cassandra's behavior to connecting multiple clients with a relational

database.

References

1. E. F. Codd: A Relational Model of Data for Large Shared Data Banks. Communications of

the ACM, 377-387 (1970).

2. Han Jing, E. Haihong, Guan Le, Jian Du: Survey on NoSQL database. In: 2011 6th Interna-

tional Conference on Pervasive Computing and Applications (ICPCA), pp. 363-366. IEEE,

Port Elizabeth (2011).

3. Makris A, Tserpes K, Andronikou V, Anagnostopoulos D: A classification of NoSQL data

stores based on key design characteristics. Procedia Computer Science, 94-103 (2016).

4. Atzeni P, Bugiotti F, Cabibbo L, Torlone R: Data modeling in the NoSQL world. Computer

Standards & Interfaces. (2016).

5. Kabakus, Abdullah Talha, and Resul Kara: A performance evaluation of in-memory data-

bases. Journal of King Saud University-Computer and Information Sciences, 520-525

(2017).

6. Van der Veen, J.S., Van der Waaij, B. and Meijer, R.J: Sensor data storage performance:

SQL or NoSQL, physical or virtual. In: 2012 IEEE 5th international conference on Cloud

computing (CLOUD), pp. 431-438. IEEE, Honolulu (2012).

7. Lee, K. K. Y., Tang, W. C., & Choi, K. S: Alternatives to relational database: comparison

of NoSQL and XML approaches for clinical data storage. Computer methods and programs

in biomedicine, 99-109 (2013).

8. Column family–Wikipedia, https://en.wikipedia.org/wiki/Column_family, last accessed

2018/07/27.

9. Jeff Carpenter, Eben Hewitt: Cassandra: The Definitive Guide. 2nd edn. O'reilly, California

(2016).

10. Getting Started with Time Series Data Modeling, https://academy.datastax.com/re-

sources/getting-started-time-series-data-modeling, last accessed 2018/07/27.

11. Robbie Strickland: Cassandra 3.x High Availability. 2nd edn. PacketPub, Birmingham

(2016).

12. Nishant Neeraj: Mastering Apache Cassandra. 2nd edn. PacketPub, Birmingham (2015).

13. Nitin Padalia: Apache Cassandra Essentials. PacketPub, Birmingham (2015).

14. Abraham Silberschatz, Henry F. Korth, S. Sudarshan: database system concepts. 6nd edn.

McGraw-Hill, NewYork (2010).

15. Comparison of B-Tree and Hash Indexes, https://dev.mysql.com/doc/refman/5.5/en/index-

btree-hash.html, last accessed 2018/07/27.

https://en.wikipedia.org/wiki/Column_family

https://academy.datastax.com/resources/getting-started-time-series-data-modeling

https://academy.datastax.com/resources/getting-started-time-series-data-modeling

https://dev.mysql.com/doc/refman/5.5/en/index-btree-hash.html

https://dev.mysql.com/doc/refman/5.5/en/index-btree-hash.html

data processing in cassandra vs mysql: a comparative ...iahpc.ir/paperspdf/85.pdf · the cassandra...

Documents