data processing in cassandra vs mysql: a comparative ...iahpc.ir/paperspdf/85.pdf · the cassandra...
TRANSCRIPT
1
Data processing in Cassandra vs MySQL: A comparative
analysis in the query performance
Seyyed Ali Hosseini1, Fereshteh-Azadi Parand1, Farzam Matinfar1
1 Allameh Tabataba'i University, Tehran, Iran
{ ali_hosseini, parand, f.matinfar}@atu.ac.ir
Abstract. Today, data generated massive, in extreme rates of speed and varies.
Besides development of the Internet and social networks, electronic device, sen-
sors and even turbines generate several gigabytes of data each day. In most ap-
plications, the relational database management systems (RDBMS) are responsi-
ble for storing and managing these data. But these databases cannot handle this
volume of data efficiently due to architectural issues. So, other suitable databases
should be replaced to meet the needs of storing and quick accessing to massive
data. In the last few years, another class of database management systems called
NoSQL are growing in popularity for managing the huge amount of data. This
class of databases can handle a large amount of data and increase the speed of
access to data. In this paper, we will describe the NoSQL databases and become
familiar with the structure of Cassandra, a well-known NoSQL database, and
finally, we compare its performance with the MySQL relational database and
present experimental result.
Keywords: NoSQL, RDBMS, Comparison, Cassandra, Big Data, Column
Family, Distributed Database.
1 Introduction
Dr. Edgar F. Codd, in 1970, when he was a staff member at the IBM Research Institute,
introduced the theory of data relational modeling in an article entitled A Relational
Model of Data for Large Shared Data Banks [1]. This article became the basis of the
work of the Relational Database Management System (RDBMS). Relational databases
are still one of the most popular applications in computer history. But today, due to the
exponential growth of data, a new domain named Big Data is introduced. Big Data is a
property that is produced at high speed, their varieties are different, some are structured,
semi-structured and unstructured and their volumes are more than several TB or PB.
Limitations such as architectural, data model and scalability make the relational data-
base unable to support the rapid growth of data and their variety. These limitations lead
to the creation of other platforms in Big Data.
The NoSQL databases are well known in the Big Data and Internet of things appli-
cations. NoSQL stands for Not Only SQL. A term applied to some database manage-
ment systems. Unlike relational databases, this type of databases does not require a
tabular structure for storing data. Manages thousands of terabytes of data and provides
faster access to massive data. These databases are distributed. Two unique features of
2
this class of databases are replication and partitioning. Replication enables developers
to replicate their data on multiple servers and continue to access data if a server fails.
Partitioning makes it easy to distribute data across multiple servers in the cluster.
NoSQL databases are divided into four classifications based on the storing method:
Column Based. such as HBase and Cassandra
Key-Value. such as Redis
Document-Oriented. like MongoDB
Graph Oriented. like neo4j
In this paper, we describe the structure and examine the performance of Cassandra,
a column-based NoSQL database, and compare it with a relational database called
MySQL. MySQL is an open source database developed and supported by Oracle.
The rest of paper is organized as follows. In the second section, we will introduce
the Cassandra data model and structure, and the query language of this database called
CQL. In the next section, we will discuss distribution in Cassandra. Section 4, describes
the experimental environment and the data used to compare two databases. In Section
5, we present experimental results of running the queries in both two databases, and
finally, in Section 6, we conclude this paper and show the future work.
2 Related Work
NoSQL databases are next generation Databases mostly addressing some of the points:
being non-relational, distributed, open-source and horizontally scalable. The original
intention has been modern web-scale databases. The movement began early 2009 and
is growing rapidly. Today, there are 18 free and widely used Open Source NoSQL Da-
tabases Jing Han et al classify NoSQL databases according to the CAP theorem and
describe the background, basic characteristics and data model of NoSQL [2]. Makris
et al. [3] and Atzeni et al. [4] analyze some NoSQL databases and describe classifica-
tion of NoSQL data stores based on key design characteristics.
Some studies compare the NoSQL databases with other database management sys-
tems, especially the relational databases. Kabakus and Kara [5] evaluate performance
of in-memory databases. The NoSQL databases used in this experiment are MongoDB,
Cassandra, Redis, and Memcached. They used the H2 relational database in this exper-
iment. Unlike the relational databases, the H2 stores data in memory. They compare
performances of databases through the four experiments: (1) time to write key-value
pairs, (2) time to read value corresponding to a given key, (3) to remove the key-value
pair corresponding to a given key, and (4) time to get all the data. They conclude that
Memcached clearly provides the best write performance. Redis uses memory more ef-
ficiently than others, and also fairly provides better performance for the read and delete
operations. MongoDB provides significantly the best performance to fetch the whole
data. van der Veen et al. [6] Compare the performance of the three PostgreSQL, Mon-
goDB, and Cassandra databases. They simulate sending data from several sensors to
compare database performance. Lee et al. [7] report comparing the performance of the
NoSQL and XML databases to store and retrieve clinics data.
3
In this paper, we intend to become deeply familiar with the structure and data model
of the Cassandra database, to explain its advantages and disadvantages. In an experi-
ment, we compare its performance with the MySQL database. We report the results of
this experiment and analyze each of the results and describe the conditions under which
cassandra database has better performance than the MySQL on one machine.
3 Cassandra structure
In this section, we will talk about how data stored in Cassandra. We describe Cassandra
primary key and its difference with relational databases. Finally, discuss about CQL
and its commands.
3.1 Cassandra Data Model
Fig. 1. A simple row in Cassandra table
Cassandra is a column family database. The column family is a NoSQL object contains
key-value pairs. Each key is mapped to a set of columns. This object borrowed some
relational database characteristics. The column family is a table that each pair of the
key-value is a row. Rows can have several different columns. Each column contains
the name, the value, and the timestamp. And you can group several columns [8]. This
set of related columns is called a super column family. In fact, the column family has
expanded key-value. The Fig. 1 shows a simple row in Cassandra.
Fig. 2. Cassandra table contains different rows
In Cassandra, related rows are associated in a logical division as a table. For example,
we want to store user information in a table named user. Creating a table is similar to
the relational database, in which we set columns and their types when creating the table.
Inserting data to Cassandra table a bit different that do not need to add all the columns
4
each time we add a row or entity. For example, people have two phone numbers and
some do not. Or in web form some fields are required and some optional. Despite the
relational database for values that we do not know, we must store null, In Cassandra
never save column and our table is like Fig. 2.
The primary key in Cassandra is like a relational database. In addition, Cassandra
has a special primary key called a composite key, which includes a partition key and a
set of clustering columns. The partition key specifies the node on which row is stored,
which can contain multiple columns. The Clustering columns also specify how to store
data in a partition. Cassandra also has another structure called the static column, which
is shared to all the rows of a partition. Fig. 3 shows a wide row of Cassandra and how
partition key and clustering columns affect storing data [9].
Fig. 3. Cassandra wide row
3.2 CQL
This structure is suitable for time series data. The data that the sensors produce, the
tweets, the comments that users write under posts, and so on. To explain this structure
deeper and get familiar with CQL (Cassandra Query Language), let's take an example
of building a table and storing weather station data [10]. The command to create the
table in CQL is as follows. The Fig.4 illustrates the table created by these commands.
CREATE TABLE temperature_by_day (
weatherstation_id text,
date text,
event_time timestamp,
temperature text,
PRIMARY KEY ((weatherstation_id, date), event_time)
);
Fig. 4. The simplest model for storing time series data for each source
5
As seen in the definition of the primary key, the two columns are placed in parentheses.
And the primary key is divided into two parts. the first part, which is in parentheses, is
the partition key and the rest of the columns are clustering keys. This structure removes
the redundancy and provides quick access to the data. An example of data insertion in
the table as follows.
INSERT INTO temperature (weatherstation_id, event_time, temperature)
VALUES (’1234ABCD’, ’2013-04-03 07:02:00′, ’73F’);
As you can see, CQL is very similar to SQL. Other commands, like insert and update,
are similar to SQL commands. But Cassandra does not support the join command, and
comparative operators can be used just on the last clustering column.
4 Cassandra Distribution
Cassandra is a free open-source distributed wide column store NoSQL database man-
agement system designed to handle large amounts of data across many commodity
servers, providing high availability with no single point of failure. It means that a logi-
cal database is stored and divided into one or more machines, each machine called node.
This database uses peer-to-peer architecture. Nodes connected to each other and create
a cluster. Cassandra has provided two grouping for the topology of clusters: rack and
data center. The rack is a logical set of nodes near each other. The data center is the
logical set of racks.
As in the data model section, Cassandra stores and accesses data with a primary key
or composite key. Because data is split between several nodes, Cassandra uses a dis-
tributed hash table (DHT) for efficient and fast access to data. In DHT, do not need to
ask each node whether it contains a key. And also does not need all nodes available to
prove that the key does not exist. It maps the key to the node store it.
Fig. 5. Data distribution and token assignment in Cassandra
But if we want to remove or add a node, the old hash function cannot be used. To solve
this problem, the Consistent hashing algorithm is used. The goal of this algorithm is
6
that each node can efficiently locate the location of each key, despite the constant de-
letion and addition of nodes within the cluster. In this way, each node that is arranged
in a ring together includes a range of keys. Fig. 5 shows the five-node form a cluster
and store data in the belong node [11].
As mentioned, since Cassandra uses peer-to-peer architecture, all nodes are inter-
connected. The client connects to one of the nodes. The node receiving the request is
called the coordinator. All nodes can play this role. If the key is not related to the coor-
dinator range, the request is sent to the other node whose key is associated.
In addition to partitioning the database, Cassandra has another feature called repli-
cation. This means that for each data, multiple copies are created in the cluster. We can
set the number of copies by the replication factor in Cassandra. This feature makes it
possible to guarantee availability in a node failure. In addition, replication causes more
than one machine involved in adding or removing nodes to migrate data, therefore per-
formance increased.
5 Cassandra Read and Write
The Cassandra database architecture is different from relational databases. Its primary
key is defined differently. It's a distributed database and uses an advanced hash function
to store and access data across multiple servers. And stores the data in memory and
disk.
Now, in this section, we want to look deeper into the details of the database archi-
tecture and Cassandra read and write operation. In the following, we describe the whole
scenario of reading and writing operations from the request sent by the client until read-
ing the files tables stored in nodes.
5.1 Write Operation
To write, the client connects to the coordinator. This node delegates the request to a
service called storageProxy. The storageProxy's job is to identify all nodes responsible
for storing this data. When the replica nodes are specified, storageProxy sends messages
to all of them. The service then waits for the response from the nodes that responsible.
Now, we describe the writing within the node.
When the write operation is performed, the data is immediately written in the commit
log. Commit log is a crash-recovery mechanism that supports Cassandra's durability
goal. The write operation will not succeed until it is written in the commit log. If the
database crash or shut down, the commit log ensures that the data is not lost. Because
when the node starts to work, the commit log is read and this is the only time this file
is read and users do not have access to this file.
After data is written in the commit log, the data is written to a memory-resident data
structure called the memTable. memTable is a structure in memory. MemTables are
immutable and there may be several memTables for each table. memTable contains
data for a specific table.
7
When the number of objects stored in the memTable reaches the threshold, its con-
tents are flushed to disk in a file named SSTable. Then a new memTable is created.
Each commit log has a bit flag indicating whether it needs flushing or not. When a write
is received, it is first written in commit log and its bit flag is 1. After the memTable is
flush on the disk, the corresponding bit flag equals 0.
There is no read or seek before write. It is one of the reasons that writing in Cassandra
is better. every writes operation in Cassandra is an append action.
SSTables are immutable. And only compacted. In compaction, SSTables of a table
merges together. SSTables get sorted in reverse chronological order (latest first).
Each SSTable has a corresponding BloomFilter stored in memory. Bloom filter is
used to boost the read operation. Bloom filter is a non-deterministic and fast algorithm
that tests whether an element is a member of a set.
5.2 Read Operation
Like the write operation, the client sends the read requests to the coordinator and coor-
dinator delegates it to proxyStorage. ProxyStorage finds a list of the replica nodes and
specifies the proximity node containing the key by snitch function. Snitch function has
seven different types to find the closest node. After the closest node is found, the query
is sent to it and the other replica nodes will send a digest. Let's explain the scenario that
happens to retrieve data inside the node.
When a read request is sent, at first memTables are read, which is very fast. If the
necessary data is not found, Cassandra will search SSTable. Since disk access is typi-
cally much slower than memory access and each SSTable has a corresponding bloom
filter stored in memory, Bloom filter is checked first before accessing disk. Apart from
the Bloom filter for Partition keys, there is another bloom filter for each row that spec-
ifies whether the column exists in SSTable or not. Now, Cassandra will take SSTables
one by one from younger to older [12].
Fig. 6. Reading data of k2 key. A2 is updated with AA2 and C2 is updated with CC2[13].
8
6 Experimental Results
In this section, we will report the practical comparison results of Cassandra and MySQL
databases. We also describe the experimental environment and data used in this com-
parison.
6.1 Experimental Environment
Since each NoSQL Database for a set of data models are very efficient and Cassandra
manage efficiently data that generated over time, such as sensors and logs, the data used
for comparing is a simulation of weather station sensors. These data are made with an
application written by java. To compare these two databases, we create a table contain-
ing four columns with the following command with index in MySQL.
CREATE TABLE climate (
city varchar (20),
date date,
time time,
temperature int (2)
);
create index first on climate (city, date);
And following commands creates the corresponding table in Cassandra. This table has
a composite key suitable for specific queries.
CREATE TABLE climate (
city text,
date date,
time time,
temperature int,
PRIMARY KEY ((city, date), time)
);
In this evaluation, we will compare only insert and select queries. For writing, an
application written by Java creates desired number of insert queries and store them in a
file. This program increases the time field for each query to a certain value. For exam-
ple, the program can be configured to simulate the sending of the temperature from
sensors every 20 seconds. As a result, the composite key in the Cassandra database will
be unique and the number of rows in the table in both databases will be the same after
several consecutive insert commands. Since the insert command is similar in CQL and
SQL a file can be used for inserting data in both databases.
In this evaluation, we use the executeAsync method in the Cassandra java driver for
insert queries. This feature allows the application to send non-blocking commands to
Cassandra. Because when the request is sent to Cassandra, the process of running the
program resumes and the program does not wait until the response from the database is
received. In the meantime, with java future feature, a thread created and wait until da-
tabase sends a response, then the desired action in case of failure or success of the insert
9
query. To evaluate the select query, we have created four select queries that are listed
in Table. 1.
Table 1. select queries performed in evaluation.
Identifier Query
Select 1 SELECT * FROM climate WHERE city 'New York' AND date
= '2016-01-04'
Select 2 SELECT * FROM climate WHERE city = 'Los Angeles' AND
date IN ('2016-01-01', '2016-01-02', '2016-01-03')
AND time >= '06:00:00' AND time <= '17:00:00'
Select 3 SELECT * FROM climate WHERE city IN ('New Orleans',
'Austin', 'Chicago') AND date = '2016-01-03'
Select 4 SELECT * FROM climate WHERE city IN ('Atlanta',
'Boston', 'New Orleans') AND date IN ('2016-01-01',
'2016-01-02', '2016-01-03')
In other evaluation, we conclude that increasing the number of table rows without
changing the time interval of sending temperature by the sensors, has no effect on the
query execution time. Only increasing the number of resulting rows of select query
affect the query execution time. Because both databases use the index to retrieve infor-
mation.
So the application creates the insert command was configured to create 5 different
tables, each table simulate sending temperature by sensor every 1, 2, 5, 10 and 20 sec-
onds. Since the concept of queries is getting temperatures daily or at a certain time, the
number of returning rows increases with a decline in seconds. Table. 2 gives the exper-
imental environment and configuration.
Table 2. The environment information in detail.
Component Information
Hardware and OS CPU
Memory
OS
Intel Core i7-2600, 3.40GHz × 4
12 GB
Ubuntu 16.04 LTS
Implementation RDB
NoSQL
Programing language
MySQL
Cassandra
Java 8
6.2 Evaluation Results
In this evaluation, we measured the runtime from executing the first query to receiving
the last query response. The Fig. 7 shows the result of the execution of the insert query.
10
The horizontal axis shows the number of consecutive insert queries or the number of
rows in the table. The vertical axis indicates the program execution time in millisec-
onds.
Fig. 7. The execution time of insert query
As seen, the insert operation in Cassandra is faster than MySQL due to the use of the
Async method and storage structure in Cassandra, where the data is first stored in
memory. In a large amount of data, this gap increased.
We observe MySQL only uses two threads of CPU with 70% utilization. But the
Cassandra database uses 92% utilization of all eight CPU threads. Since application
uses Async method, it uses 4 or 5 threads and the Cassandra database use 4 threads.
And this means that with Cassandra we can use multi-threaded programming and par-
allelization.
The figures in Fig. 8 shows the runtime of the select queries. The horizontal axis
displays the number of rows returned in the query response and the vertical axis dis-
plays runtime in milliseconds.
100 k 500 k 1 m 2 m 5 m
Cassandra 8401 20845 35356 71223 189368
MySQL 7408 36945 74102 154040 380501
8401 20845 3535671223
189368
740836945
74102
154040
380501
0
100000
200000
300000
400000
MIL
LI S
ECO
ND
ROW COUNT
26184 176
359
920
93 66 94203
426
0200400600800
1000
4320 8640 17280 43200 86400
mill
i sec
on
d
Select 1
MySQL Cassandra
95 173 313738
2283
175 84 128336
557
0
5001000
1500
20002500
5943 11883 23763 59403 118803
mill
i sec
on
d
Select 2
MySQL Cassandra
11
Fig. 8. The execution time of select queries shown in Table. 1 on five different tables
To evaluate the select query, we first have a deeper look at the indexing in the relational
database. Relational databases use two types of indexing [14]: B-Tree and Hash.
B-Tree is used for queries where the comparison contains operators =, <, <=, >, >, =
or between and like. The time to access data in this algorithm is log (n). The Hash index
is used only for query contain the operator = and are very fast. This method does not
use for comparative operators. Also, the optimizer cannot use the hash index for
ORDERED BY operator [15]. In this research, we have used the B-Tree index in
MySQL. Because queries have comparative operators.
Observing the results related to the Select operations, Cassandra has a lower re-
sponse time for select 1 and 2 compared to MySQL. Cassandra accesses the data di-
vided into partitions with the key in O(1). Because Cassandra uses Bloom Filter, index
SSTable rows and sort rows by clustering columns in SSTable, While MySQL access
to data in O(log n) by using B-Tree index. This makes indexing stronger in Cassandra.
But in select 3 and 4, Cassandra isn’t better Because Cassandra has some limitations on
querying and one of them is using IN keyword. The IN condition is recommended on
the last column of the partition key. Using IN can degrade performance. usually, many
nodes must be queried. For example, in a single, local data center cluster with 30 nodes,
a replication factor of 3, a single key query goes out to two nodes, but if the query uses
the IN condition, the number of nodes being queried are most likely even higher, up to
20 nodes depending on where the keys fall in the token range.
7 Conclusions and future work
We presented a general overview of the Cassandra structure, such as the distribution of
data and queries in this paper. Cassandra has a column family architecture that is ap-
75 168 213463
1022
216 147 252
574
1198
0
300
600
900
1200
1500
12960 25920 51840 129600 259200
mill
i sec
on
d
Select 3
MySQL Cassandra
184 304 5021201
3463
309 332739
1894
3424
0
1000
2000
3000
4000
38880 77760 155520 388800 777600
mill
i sec
on
d
Select 4
MySQL Cassandra
12
propriate for time series data. This is clearly seen in the comparison between this data-
base and MySQL using Java language and weather station data. We also did not use the
Cassandra distribution feature in this experiment, which can boost its performance. We
compare the performance of two databases on one machine. As future work, this feature
of Cassandra can be considered. We also will evaluate other data models using the
properties of this database and make comparisons with other NoSQL databases. We
want to compare Cassandra's behavior to connecting multiple clients with a relational
database.
References
1. E. F. Codd: A Relational Model of Data for Large Shared Data Banks. Communications of
the ACM, 377-387 (1970).
2. Han Jing, E. Haihong, Guan Le, Jian Du: Survey on NoSQL database. In: 2011 6th Interna-
tional Conference on Pervasive Computing and Applications (ICPCA), pp. 363-366. IEEE,
Port Elizabeth (2011).
3. Makris A, Tserpes K, Andronikou V, Anagnostopoulos D: A classification of NoSQL data
stores based on key design characteristics. Procedia Computer Science, 94-103 (2016).
4. Atzeni P, Bugiotti F, Cabibbo L, Torlone R: Data modeling in the NoSQL world. Computer
Standards & Interfaces. (2016).
5. Kabakus, Abdullah Talha, and Resul Kara: A performance evaluation of in-memory data-
bases. Journal of King Saud University-Computer and Information Sciences, 520-525
(2017).
6. Van der Veen, J.S., Van der Waaij, B. and Meijer, R.J: Sensor data storage performance:
SQL or NoSQL, physical or virtual. In: 2012 IEEE 5th international conference on Cloud
computing (CLOUD), pp. 431-438. IEEE, Honolulu (2012).
7. Lee, K. K. Y., Tang, W. C., & Choi, K. S: Alternatives to relational database: comparison
of NoSQL and XML approaches for clinical data storage. Computer methods and programs
in biomedicine, 99-109 (2013).
8. Column family–Wikipedia, https://en.wikipedia.org/wiki/Column_family, last accessed
2018/07/27.
9. Jeff Carpenter, Eben Hewitt: Cassandra: The Definitive Guide. 2nd edn. O'reilly, California
(2016).
10. Getting Started with Time Series Data Modeling, https://academy.datastax.com/re-
sources/getting-started-time-series-data-modeling, last accessed 2018/07/27.
11. Robbie Strickland: Cassandra 3.x High Availability. 2nd edn. PacketPub, Birmingham
(2016).
12. Nishant Neeraj: Mastering Apache Cassandra. 2nd edn. PacketPub, Birmingham (2015).
13. Nitin Padalia: Apache Cassandra Essentials. PacketPub, Birmingham (2015).
14. Abraham Silberschatz, Henry F. Korth, S. Sudarshan: database system concepts. 6nd edn.
McGraw-Hill, NewYork (2010).
15. Comparison of B-Tree and Hash Indexes, https://dev.mysql.com/doc/refman/5.5/en/index-
btree-hash.html, last accessed 2018/07/27.