towards a representative benchmark for time series databases
TRANSCRIPT
CONFIDENTIAL UP TO AND INCLUDING 03/01/2017 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY
databasesTowards a representative benchmark for time series
Academic year 2018-2019
Master of Science in de industriële wetenschappen: elektronica-ICT
Master's dissertation submitted in order to obtain the academic degree of
Counsellors: Dr. ir. Joachim Nielandt, Jasper VaneessenSupervisors: Prof. dr. Bruno Volckaert, Prof. dr. ir. Filip De Turck
Student number: 01610806Thomas Toye
ii
CONFIDENTIAL UP TO AND INCLUDING 03/01/2017 - DO NOT COPY, DISTRIBUTE OR MAKE PUBLIC IN ANY WAY
databasesTowards a representative benchmark for time series
Academic year 2018-2019
Master of Science in de industriële wetenschappen: elektronica-ICT
Master's dissertation submitted in order to obtain the academic degree of
Counsellors: Dr. ir. Joachim Nielandt, Jasper VaneessenSupervisors: Prof. dr. Bruno Volckaert, Prof. dr. ir. Filip De Turck
Student number: 01610806Thomas Toye
PREFACE iv
Preface
I would like to thank my supervisors, Prof. dr. Bruno Volkaert and Prof. dr. ir.
Filip De Turck.
I am very grateful for the help and guidance of my counsellors, Dr. ir. Joachim
Nielandt and Jasper Vaneessen.
I would also like to thank my parents for their support, not only during the writing
of this dissertation, but also during my transitionary programme and my master’s.
The author gives permission to make this master dissertation available for consul-
tation and to copy parts of this master dissertation for personal use. In all cases
of other use, the copyright terms have to be respected, in particular with regard to
the obligation to state explicitly the source when quoting results from this master
dissertation.
Thomas Toye, June 2019
Towards a representative benchmark
for time series databases
Thomas Toye
Master’s dissertation submitted in order to obtain the academic degree of
Master of Science in de industriele wetenschappen:
elektronica-ICT
Academic year 2018–2019
Supervisors: Prof. dr. Bruno Volckaert, Prof. dr. ir. Filip De Turck
Counsellors: Dr. ir. Joachim Nielandt, Jasper Vaneessen
Summary
As the fastest growing database type, time series databases (TSDBs) have expe-rienced a rise in database vendors, and with it, a rise in difficulty in selecting thebest one. TSDB benchmarks compare the performance of different databases toeach other, but the workloads they use are not representative: they use randomdata, or synthesized data that is only applicable to one domain. This disserta-tion argues that these non-representative benchmarks may not always accuratelymodel real world performance, and instead, representative workloads should beused in TSDB benchmarks. In this context, workloads are defined as consistingof data sets and queries. Workload data sets can be categorized using eight pa-rameters (number of metrics, regularity, volume, data type, number of tags, tagvalue data type, tag value cardinality, variation). A new benchmark was created,which uses three representative workloads next to a baseline non-representativeworkload. Results of this benchmark show significant performance differences fordata ingestion speed for complex data, latency and maximum request rate (whenbroad time ranges are used), and storage efficiency of data points when comparingrepresentative and non-representative workloads. The results show that existingbenchmarks may not be accurate for real world performance.
Keywords
Time series database, representative benchmarking, load testing
Towards a representative benchmarkfor time series databases
Thomas Toye
Supervisor(s): Bruno Volckaert, Filip De Turck
Abstract— As the fastest growing database type, time series databases(TSDBs) have experienced a rise in database vendors, and with it, a rise indifficulty in selecting the best one. TSDB benchmarks compare the perfor-mance of different databases to each other, but the workloads they use arenot representative: they use random data, or synthesized data that is onlyapplicable to one domain. We argue that these non-representative bench-marks may not always accurately model real world performance, and in-stead, representative workloads should be used in TSDB benchmarks. Inthis context, workloads are defined as consisting of data sets and queries.Workload data sets can be categorized using eight parameters (number ofmetrics, regularity, volume, data type, number of tags, tag value data type,tag value cardinality, variation).
A new benchmark was created, which uses three representative work-loads next to a baseline non-representative workload. Results of this bench-mark show significant performance differences for data ingestion speed forcomplex data, latency and maximum request rate (when broad time rangesare used), and storage efficiency of data points when comparing represen-tative and non-representative workloads. The results show that existingbenchmarks may not be accurate for real world performance.
Keywords— Time series database, representative benchmarking, loadtesting
I. INTRODUCTION
TIME SERIES DATABASES provide storage and interfac-ing for time series. In its simplest form, time series data
are just data with an attached timestamp. This subtype of datahas seen increasing interest in the last decade, especially withthe rise of the Internet of Things, which produces time series foreverything from temperature to sea levels. Other areas wheretime series are used are the financial industry (e.g. historicalanalysis of stock performance), the DevOps industry (e.g. cap-ture of metrics from a server fleet) and the analytics industry(e.g. tracking ad performance over time).
Finding the best database to use is not an easy task. Eighty-three existing TSDBs were found by Bader et al. [1]. To deter-mine the best one, benchmarks are used. However, these bench-marks may not be representative of the use case or industry theTSDB is needed for, which makes their results difficult to gen-eralize.
In this abstract, we will first analyze existing TSDB bench-marks. Then, a new benchmark is proposed, which comparesrepresentative workloads to non-representative workloads. Theresults of this benchmark are analysed to
II. EVALUATION OF EXISTING BENCHMARKS
Chen et al. [2] consolidate the properties of a good bench-mark as follows: 1. Representative: Benchmarks must simulatereal world conditions, both the input to a system and the sys-tem itself should be representative of real world usage. 2. Rel-evant: Benchm arks must measure relevant metrics and tech-nologies. Results should be useful to compare widely-used so-lutions. 3. Portable: Benchmarks should provide a fair compar-
ison by being easily extensible to competing solutions that solvecomparable problems. 4. Scalable: Benchmarks must be able tomeasure performance in a wide range of scale. Not just single-n-ode performance, but also cluster configurations. 5. Verifiable:Benchmarks should be repeatable and independently verifiable.6. Simple: Benchmarks must be easily understandable, whilemaking choices that do not affect performance.
Existing TSDB benchmarks were evaluated, a summary isshown in Table II. Two gaps in the state of the art are clear: cur-rent benchmarks insufficiently test TSDB performance at scale,and current benchmarks are not representative or only represen-tative for a single use case. The data used is either random, orsynthetic; real world data are not used. This begs the question:are results of a non-representative benchmark generalizable toreal world performance?
Rep
rese
ntat
ive
Rev
elan
t
Port
able
Scal
able
Ver
ifiab
le
Sim
ple
TS-BenchmarkFor IoT
3 3 7 3 3use cases
IoTDB-benchmark 7 3 3 7 3 3
TSDBBench 7 3 3 3 3 7
FinTimeFor financial
3 3 7 7 7use cases
influxdb-comparisonsFor DevOps
3 3 7 3 3use cases
TABLE IEVALUATION OF EXISTING TSDB BENCHMARKS
III. BENCHMARK COMPONENTS
A new benchmark is developed to compare benchmark per-formance between representative and non-representative work-loads. Workloads consist of a workload data set that is loadedinto the TSDB and a workload query set that executes upon it.
A. Data set
Time series data sets have the following properties in com-mon: data arrives in order, updates are very rare to non-existent,deletion is rare, and data values follow a pattern.They differ on the following characteristics:
• Metrics: Data points are organizaed in metrics, which can becompared to tables in relational databases.
• Regularity: In regular time series, data points are spaced evenlyin time. Irregular time series do not emit data points regularly.Irregular time series are often the result of event triggers.
• Volume: High volume time series may emit hundreds of thou-sands of data points a seconds, while low volume time seriesonly emit one event a day.
• Data type: Traditionally, values of data points in a time serieshave been integers or floating point numbers. But they can alsobe booleans, strings or even custom data types.
• Tags: A time series data point may have one or more tags asso-ciated with the timestamp and value. There may be no tags ora lot of tags. Tags may hold special values, such as geospatialinformation.
• Tag value cardinality: The number of possible combinationsthe tag values make. Three tags with two possible values eachmake a tag value cardinality of six.
• Variation: While time series data usually follow a pattern, thevariation in a series may be very different. One series may de-scribe a flat line, while another may describe seasonal variationswith daily spikes.
B. Query set
Bader et al. describe ten distinct TSDB queries capabilitiesin [1]. These building blocks (e.g. update, delete, select from atime range) can form time series queries (e.g. select the meanof temperature values from last year, aggregated by day). Nextto the queries themselves, the relative frequency is an importantpart of the query set.
C. Measurement characteristics
Measurement characteristics describe the performance met-rics that are monitored to quantify performance. For TSDBbenchmarks, common metrics include response latency (mean,95th and 99th percentile, etc.), response size, data ingestionspeed, and storage efficiency.
IV. A REPRESENTATIVE BENCHMARK
A benchmark was created with representativeness as its de-sign goal. It compares three representative workloads to a non-representative workload to investigate possible performance dif-ferences. Three real world data sets, from domains in whichTSDB are prevalent, are used, next to a baseline. The baselineis a non-representative data set, with random values and tags.For every data set, twenty queries are written, relevant to thedata set’s domain (e.g. getting the average rating for a moviein the ratings data set), except for the baseline, for which a sin-gle query is used. Vegeta [3] was use to capture response la-tency (mean and 95th percentile), response times, and responsesize. The http load program [4] was used for load testing.Standard UNIX tools were used for storage efficiency analysis.Four TSDBs are tested: InfluxDB, OpenTSDB, KairosDB withCassandra as a backing database and KairosDB with ScyllaDB.These are modern, open source databases with an HTTP inter-face.
Table II shows an overview of the data sets used. The base-line is a data set with random values and tags, the financial data
Baseline Financial Rating IoTMetrics 1 6 1 7
Regularity Regular Semi-reg. Irregular RegularVolume Low Low Low Low
Tags 2 1 5 0Tag value 10,000 7,164 20M 0cardinalityVariation High Low High LowTotal data 20M 74.4M 20M 14,5M
pointsLicense NA CC0 Custom CC-BY-4
TABLE IIOVERVIEW OF WORKLOAD DATA SETS
set uses historical stock market information, the rating data setuses movie reviews and the IoT data set is produced by powerinformation for a house.
V. EVALUATION
A. Storage efficiency
Figure 1 shows relative storage efficiency. The size in bytesper data point was compared to the size per data point in thesource comma seperated value (CSV) source. The input sizewas one million data points for every data set. It shows that rep-resentative data sets have different storage efficiency than thereference. OpenTSDB is better at storing real world data setsthan synthesized data, InfluxDB much worse. Tag value cardi-nality and data point value variation are thought to have a highimpact on storage efficiency.
Baseline IoT Financial Ratings0
20
40
1.4
4
3.3
3
5.8
2
7.2
6
2.4
2
0.2
6
1.5
7
0.8
9
2.7
2.2
4 6.2
1
3.2
9
18.7
9
31.1
7 36.5
1
1 1 1 1
rela
tive
size
InfluxDB OpenTSDBKairosDB-Cassandra KairosDB-ScyllaDB
CSV
Fig. 1. Relative storage efficiency of different TSDBs per data point comparedto the CSV source format.
B. Data ingestion throughput
For every data sets, one million data points were loaded intoeach TSDBs and ingestion speed was measured (in data pointsper second). The results are shown in Figure 2. For the rep-resentative ratings workload, performance is degradede, espe-cially for InfluxDB. This is a data set with high tag cardinalityand complex tag values.
Baseline IoT Financial Ratings
104
106 4.8
2·1
05
3.1
8·1
05
1.5
6·1
05
4,3
42
89,3
60
1.6
2·1
05
86,5
78
21,1
96
54,7
92
87,4
13
78,1
98
29,9
13
59,2
31
98,7
36
82,5
35
Dat
apo
ints
pers
econ
d
InfluxDB OpenTSDBKairosDB-Cassandra KairosDB-ScyllaDB
Fig. 2. Data points ingested per second. Data sets used were one million datapoints each.
C. Load testing
Figure 3 shows results of the load test. The results forOpenTSDB are surprising: it performed well for the baselineand IoT query workloads, but not for the financial and ratingsquery workloads. For the latter two workloads, the time rangesare very broad, so the database has to scan more data. The otherTSDBs may be able to optimize this operation better.
Baseline IoT Financial Ratings100
103
106
6,4
00.3
6
997.5
7
235.7
3
78.3
3
347.3
7
117.3
15.5
7
2.1
312.8 29.4
3
26.8
3
15.4
7
40.1
7
28.5
7
Req
uest
spe
rsec
ond
InfluxDB OpenTSDBKairosDB-Cassandra KairosDB-ScyllaDB
Fig. 3. Maximum requests per second. Tests were performed on data sets onemillion data points in size.
D. Response latency
Figure 4 shows the mean response latency when using a repre-sentative query set. A performance degradation for OpenTSDBsurfaces for the financial and ratings query workloads, whichuse broad time ranges. Otherwise, the baseline is a good predic-tor for relative performance in the representative benchmarks.This is attributed to the same cause as in Section V-C.
Baseline IoT Financial Ratings
100
102
104
1.2
7 7.6
4
57.8
8
104.4
1
12.5
6
18.2
9
2,6
81.4
4
2,5
63.0
2
862.6
4
155.9
1
106.8
5
136.8
3
74.4
9
102.1
2
Lat
ency
(ms)
InfluxDB OpenTSDBKairosDB-Cassandra KairosDB-ScyllaDB
Fig. 4. Mean latency per request.
E. Response size
Figure 5 shows the mean response size of TSDBs in bytes.The mean response size is correlated with the data set. The sizedifferences for large responses (e.g. financial workload) can beattributed mainly to timestamp encoding.
Baseline IoT Financial Ratings
102
103
104
105
185
507.3
5
33,1
86.1
1,2
50.3
5
126
350.4
5
28,8
54.3
5
390.6
5
202
459.4
23,1
17.7
5
202
459.4
23,1
17.7
5
Res
pons
esi
ze(b
ytes
)
InfluxDB OpenTSDBKairosDB-Cassandra KairosDB-ScyllaDB
Fig. 5. Mean size in bytes of the TSDB response.
VI. CONCLUSIONS
Compared to a baseline non-representative workload, repre-sentative workloads showed significant performance differenceswhen it came to storage efficiency, data ingestion speed for com-plex data, latency and maximum request rate (when broad timeranges are used). Existing TSDB benchmarks do not use rep-resentative workloads, thus their relevance may be called intoquestion.
The fact that not all representative workloads show perfor-mance impact highlights the importance of using multiple rep-resentative workloads for general TSDB benchmarks - just onerepresentative workload may not be enough to highlight possibledeviations or performance degradations.
It is unpractical to create a representative workload for everydomain, but TSDB workload can be characterized by workloadparameters. Further research is needed to determine if these pa-rameters are enough to accurately describe a TSDB workloadand thus generalize results of one workload to another with thesame workload parameters.
REFERENCES
[1] Andreas Bader, Oliver Kopp, Micheal Falkenthal, Survey and Comparisonof Open Source Time Series Databases, Datenbanksysteme fur Business,Technologie und Web (BTW2017) – Workshopband.
[2] Yanpei Chen, Francois Raab, Randy Katz, From TPC-C to Big Data Bench-marks: A Functional Workload Model, Specifying Big Data Benchmarks.WBDB 2012, WBDB 2012. Lecture Notes in Computer Science, vol 8163.Springer, Berlin, Heidelberg.
[3] Tomas Senart, Vegeta – HTTP load testing tool and library,https://github.com/tsenart/vegeta
[4] Jef Poskanzer, http load, https://acme.com/software/http_load/
CONTENTS ix
Contents
Preface iv
Abstract v
Extended abstract vi
Table of Contents ix
1 Introduction 1
2 Literature review 2
2.1 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Database Management Systems . . . . . . . . . . . . . . . . 2
2.1.2 Relational databases . . . . . . . . . . . . . . . . . . . . . . 2
2.1.3 Non-relational databases . . . . . . . . . . . . . . . . . . . . 3
2.1.4 NewSQL databases . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.5 Time series databases . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Time series database benchmarks . . . . . . . . . . . . . . . . . . . 4
2.2.1 TS-Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 IoTDB-benchmark . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.3 TSDBBench . . . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.4 FinTime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.2.5 influxdb-comparisons . . . . . . . . . . . . . . . . . . . . . . 7
2.2.6 STAC-M3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3 Data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
3 State of the art 10
3.1 Uses of time series databases . . . . . . . . . . . . . . . . . . . . . . 10
3.1.1 TSDB usage as a data store . . . . . . . . . . . . . . . . . . 10
x
3.1.2 Inherent time series database functions used . . . . . . . . . 11
3.1.3 Common characteristics of time series data . . . . . . . . . . 12
3.1.4 Differing characteristics of time series data . . . . . . . . . . 12
3.1.5 Industry use cases . . . . . . . . . . . . . . . . . . . . . . . . 13
3.2 A “good” benchmark . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3 Existing benchmarks . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.3.1 TS-Benchmark . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.2 IoTDB-benchmark . . . . . . . . . . . . . . . . . . . . . . . 17
3.3.3 TSDBBench/YCSB-TS . . . . . . . . . . . . . . . . . . . . . 18
3.3.4 FinTime . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.3.5 influxdb-comparisons . . . . . . . . . . . . . . . . . . . . . . 19
3.4 Evaluation of existing benchmarks . . . . . . . . . . . . . . . . . . . 20
3.4.1 On scalability . . . . . . . . . . . . . . . . . . . . . . . . . . 20
3.4.2 On representativeness . . . . . . . . . . . . . . . . . . . . . . 21
3.5 Contribution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 21
4 A new benchmark 23
4.1 Benchmark components . . . . . . . . . . . . . . . . . . . . . . . . 23
4.1.1 Workload data set characteristics . . . . . . . . . . . . . . . 23
4.1.2 Workload query characteristics . . . . . . . . . . . . . . . . 24
4.1.3 Measurement characteristics . . . . . . . . . . . . . . . . . . 24
4.2 Design of a representative data workload . . . . . . . . . . . . . . . 25
4.2.1 A baseline workload . . . . . . . . . . . . . . . . . . . . . . 25
4.2.2 A financial time series workload . . . . . . . . . . . . . . . . 26
4.2.3 A rating system workload . . . . . . . . . . . . . . . . . . . 27
4.2.4 An IoT workload . . . . . . . . . . . . . . . . . . . . . . . . 28
4.2.5 Workload data set overview . . . . . . . . . . . . . . . . . . 29
4.2.6 Data set pre-processing . . . . . . . . . . . . . . . . . . . . . 29
4.3 Design of a representative query workload . . . . . . . . . . . . . . 30
4.3.1 Queries for the baseline workload . . . . . . . . . . . . . . . 30
4.3.2 Queries for the financial workload . . . . . . . . . . . . . . . 31
4.3.3 Queries for the rating workload . . . . . . . . . . . . . . . . 31
4.3.4 Queries for the IoT workload . . . . . . . . . . . . . . . . . 32
4.4 Metrics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.5 Technical implementation . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.1 Test environment . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.2 Data ingestion . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.5.3 Load and latency testing . . . . . . . . . . . . . . . . . . . . 34
xi
4.6 Design evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
5 Results 36
5.1 Storage efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 36
5.2 Data ingestion throughput . . . . . . . . . . . . . . . . . . . . . . . 39
5.3 Load testing with query workload . . . . . . . . . . . . . . . . . . . 40
5.4 Response latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.5 Mean response size . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
5.6 Evaluation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
6 Conclusions and future work 48
6.1 Conclusions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 48
6.2 Future work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
A Detailed results 51
A.1 Data ingestion throughput . . . . . . . . . . . . . . . . . . . . . . . 51
A.2 Storage efficiency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51
A.3 Load testing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.4 Response latency . . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
A.5 Mean response size . . . . . . . . . . . . . . . . . . . . . . . . . . . 52
Bibliography 54
List of Abbreviations 57
List of Figures 59
List of Tables 60
INTRODUCTION 1
Chapter 1
Introduction
Time series databases provide storage and interfacing for time series. In its simplest
form, time series data are just data with an attached timestamp. This subtype of
data has seen increasing interest in the last decade, especially with the rise of the
Internet of Things, which produces time series for everything from temperature
to sea levels. Other areas where time series are used are the financial industry
(e.g. historical analysis of stock performance), the DevOps industry (e.g. cap-
ture of metrics from a server fleet) and the analytics industry (e.g. tracking ad
performance over time).
Time Series Databases (TSDBs) are the fastest growing type of databases. When
selecting a TSDB, performance is one of the main considerations. Comparing
database performance is done using benchmarks, and for TSDBs, a number of
benchmarks already exist. However, these all use either random data or synthetic
data. Moreover, TSDBs have a wide range of applications, and representative
synthesized is only be valid for one domain. Thus, the data used for benchmarks is
either non-representative, or only representative for one use case or industry. Can
the results of performance tests with random or generated data be generalized to
the real world?
In this abstract, we will first analyze existing TSDB benchmarks. Then, properties
of time series data sets are analyzed. Finally, a new benchmark is proposed, which
compares representative workloads to non-representative workloads.
LITERATURE REVIEW 2
Chapter 2
Literature review
2.1 Databases
A database is a set of data, organized in a form that makes it easy to process.
2.1.1 Database Management Systems
A Database Management System (DBMS) is an application for management of
databases. Apart from the creation and deletion of databases, a DBMS allows
create, read, update and delete (CRUD) operations on these databases.
A database is the data itself and how it is organized. The term “database” is often
used instead of “DBMS”. In this dissertation, the two are used interchangeably.
2.1.2 Relational databases
Edgar Codd introduced the relational model in 1970 [1]. Relational databases
use this model to store data, which is represented by rows, attributes of this
data are organized in columns, and the data itself in tables. A relational DBMS
(RDBMS) will most often use Structured Query Language (SQL) for data retrieval
and manipulation.
3
2.1.3 Non-relational databases
As applications began to scale, companies started moving away from traditional
RDBMSs for the following reasons [2]:
• In traditional DBMSs, the focus on correctness leads to degraded perfor-
mance.
• The relational model was thought not to be the best way to store data.
• The DBMSs were often used as simple data stores. A full-blown DBMS was
overkill for such use cases.
These factors caused a move to so-called “NoSQL” databases. The term used to
refer to databases that do away with the relational structure of RDBMSs, but has
taken on the meaning of “Not only SQL” [3]. Catell [4] identifies six key features
of NoSQL DBMSs:
1. Horizontal scalability
2. Replication and partition of data over many machines
3. Simple interface (relative to SQL)
4. Weaker concurrency model (compared to ACID nature of relational DBMSs)
5. Distributed indexes used for data storage
6. Able to add new attributes to existing data
NoSQL databases generally do away with the correctness found in relational
databases. For example, transactions may not be available in NoSQL DBMSs, or
writes may take a while to propagate and show up in reads.
4
2.1.4 NewSQL databases
NewSQL databases try to bridge RDBMS and NoSQL DBMS differences by bring-
ing relational semantics to NoSQL DBMSs [3]. The aim is to have the best of both
worlds: the relational model of RDBMSs and the scalability and fault tolerance of
NoSQL DBMSs.
2.1.5 Time series databases
Time series databases (TSDBs) are databases optimised for storing time series.
Time series are represented in these databases as data points with a value, a
timestamp, and metadata, such as a metric name, tags, and geospatial information.
Time series databases can be relational (e.g. Timescale, a NewSQL DBMS) or
non-relational (e.g. InfluxDB, a NoSQL DBMS) databases.
Bader et al. [5] identified 75 TSDBs, of which 42 are open source and 33 are
proprietary.
2.2 Time series database benchmarks
There are a number of existing benchmarks tailored to TSDBs. This is a recent
development: most of these benchmarks were developed less than three years ago.
2.2.1 TS-Benchmark
TS-Benchmark is a benchmark specifically developed for TSDBs by Chen at the
Renmin University of China in December 2018. A new benchmark was modelled
based on a wind farm scenario: sensor data are appended and queried [6].
Databases tested in this benchmark are InfluxDB, IotDB, TimescaleDB, Druid,
and OpenTSDB. The benchmark is written in Java and uses no external depen-
dencies or frameworks.
5
Apart from a presentation, not much information is available on TS-Benchmark.
Metrics measured by TS-Benchmark:
• Load performance: The ingestion speed of the TSDB, which measured in
points loaded per second
• Throughput performance: new data points appended to an existing time
series (measured in points appended per second)
• Query performance: For both simple aggregation queries and time range,
read queries are performaed and two measurements are made: requests per
second and average response time.
• Stress test. Two stress tests are performed. In the first, data points are
appended while a constant number of queries are ran (performance measured
in points appended per second). In the second, queries are run while a
constant number of data points are appended (performance measured in
requests per second and average response time).
Load performance is different from throughput performance. The former measures
the importing of a big data set into the database, while the latter measures ap-
pending points in real-time. It is unclear if the benchmark uses special facilities
to test load performance (e.g. bulk or batch functionality from the TSDB) or if
importing is needed to test read queries.
2.2.2 IoTDB-benchmark
In preprint paper on arXiv, Liu and Yuan describe IoTDB-Benchmark [7]. The
features that set this benchmark apart from basic benchmarks are generation of
out-of-order data, measurement of system resources, next to database performance
metrics, and simulation of real-world conditions by running heterogeneous queries
concurrently. IotDB-benchmark is written in Java.
IotDB-benchmark has 10 types of queries, ranging from “latest data point” to
“time range query with value filter”. InfluxDB, OpenTSDB, KairosDB, and
6
TimescaleDB are targeted by IoTDB-benchmark. The benchmark also supports
Cloud Time Series Database (CTSDB), a TSDB created by Tencent Cloud1, but
this is not mentioned in the paper.
Metrics measured by IotDB-benchmark:
• Query latency: Statistical metrics, such as average, maximum, 95th per-
centile, etc. are calculated on the time the ten supported query types take.
• Throughput performance: Data points appended to an existing time
series, measured in points appended per second.
• Space consumption: The used disk space is measured.
• : System resources: System resources, such as CPU time, network, mem-
ory and I/O usage are measured.
2.2.3 TSDBBench
TSDBBench was created by Bader as part of his dissertation in 2016. It extends
the Yahoo! Cloud Serving Benchmark (YCSB) for use with time series databases in
a project called YCSB-TS. TSDBBench includes YCSB-TS, the benchmark itself,
and Overlord, a provisioning system written in Python that sets up databases to
test [5].
In practice, the benchmark seems unmaintained. The documentation is out of
date, necessary files are hosted on a defunct domain, and the database versions
tested are several years old.
Ten types of queries are supported, such as “insert”, “update”, “scan” and “sum”.
TSDBBench supports eighteen databases, which is the most of any TSDB bench-
mark.
Metrics measured by TSDBBench:
1Not much documentation on CTSDB is available, and all of it is in Chinese.
7
• Query latency: Statistical metrics, such as average, maximum, 95th per-
centile, etc., are calculated on the time the ten supported query types take.
• Space consumption: The used disk space is measured.
2.2.4 FinTime
FinTime was developed in 1999. It is not written in a specific language: FinTime
is merely a description of a benchmark. The benchmark describes two models,
including data model, queries, and operational characteristics [8]. They contain
nine queries run by five clients at once, and six queries run by fifty clients at once,
respectively.
Metrics measured by FinTime:
• Query latency (defined as “Response Time Metric”): The geometric mean
of query latencies.
• Throughput Metric: Average time that a complete set of queries take.
Every set (nine queries for the first model, six for the second) represents a
user.
• Cost metric: Defined as R×TTC
, where R is the response time metric, T is
the throughput metric, and TC is the total cost of the system in USD. This
metric provides insight in the cost-effectiveness of a system.
2.2.5 influxdb-comparisons
The project influxdb-comparisons is created by InfluxData, the company that
develops InfluxDB. It compares the InfluxDB TSDB to other databases. The
project is written in Go and was started in 2016.
At this moment, the benchmark supports InfluxDB, Elasticsearch, Cassandra,
MongoDB and OpenTSDB.
Metrics measured by influxdb-comparisons:
8
• Space consumption: After batch loading data, disk usage is measured.
• Load performance: Measured in time taken to load the data and average
ingestion rate.
• Query latency: Measured as queries per second.
2.2.6 STAC-M3
STAC-M3 is a closed-source benchmark that measures performance of TSDB
stacks, focused on high-speed applications. The publications, specification, and ap-
plication itself are only accessible to Securities Technology Analysis Center (STAC)
members.
At the moment, only results for the kdb+ database have been published publicly.
The following metrics are measured:
• Storage efficiency: The size of the original data set divided by the size of
the database.
• Mean and maximum response times for a variety of scenarios. For most
scenarios, minimum and median response times are also reported, as well as
the standard deviation.
2.3 Data sets
To study and create benchmarks for TSDBs, it is important to understand the
fields where time series are recorded and analyzed. Six existing repositories of
time series data sets were discovered.
Dau et al. maintain a repository of 128 time series data sets for data mining
and machine learning purposes [9]. The data sets range from electricity usage to
accellerometer data of performed gestures. Every data set is cleaned and docu-
mented.
9
The Center for Machine Learning and Intelligent Systems at the University of
California maintains a database of data sets for use with machine learning [10].
Ninety-two time series data sets are currently in their repository, with domains
ranging from stress detection and retail to electricity consumption and parking
occupancy rates.
Hyndman created the Time Series Data Library (TSDL), which contains about
eight hundred time series data sets [11]. TSDL spans many domains, from hydrol-
ogy and finance to crime and physics.
A “data catalog start-up”called data.world currently has thirty-four time series
data sets in its repository [12]. The data sets are mostly governmental statistics,
such as crime data and pollution indexes.
On Kaggle, 238 data sets show up when searching for time series databases. These
data sets are contributed by different authors.
Leskovec and Krevl maintain the Stanford Network Analysis Project (SNAP) data
sets [13]. These data sets are often graphs, but the online reviews and online
communities data sets contain time series data.
STATE OF THE ART 10
Chapter 3
State of the art
In this chapter, the various uses of time series databases will be examined. Then,
existing benchmarks are evaluated, and gaps in the state of the art are examined.
3.1 Uses of time series databases
3.1.1 TSDB usage as a data store
Some use cases do not exploit the full potential of time series databases, they
merely use a time series database as a data store for time-coupled data. While
the data could be stored in another data store, using a time series database offers
clear advantages:
• Compression: Since time series data arrives mostly in-order, high compres-
sion ratios can be achieved efficiently with delta coding, or more advanced
compression algorithms, such as SPRINTZ [14].
• Scalability: Most modern time series databases come with scalability built-
in, removing the need to worry about data migration when applications
become bigger or more data-intensive.
• Usage of inherent time functions when needed: Even if an application
makes no use of time series functions, they could do so at a later time, without
11
the need for data migration. This also holds true for arbitrary queries: when
engineers want to run time-based arbitrary queries, they can do so without
data transformation or migration.
Anomaly detection, forecasting and prediction are examples that usually use the
time series database as a data store: a separate application provides the processing.
3.1.2 Inherent time series database functions used
Most TSDBs are not simple data stores, but provide specialised functions to handle
times series analysis and aggregation. Bader et al. [5] describe the following time
series database capabilities:
• INS: Insertion of a single data point
• UPDATE: Update of one or more data points with a certain timestamp
• READ: Retrieval of one or more data points with a certain timestamp
• SCAN: Retrieval of rows in a timestamp range
• AVG: Calculates the average value in a time range
• SUM: Calculates the sum of values in a time range
• CNT: Counts the number of data points with a certain timestamp
• DEL: Deletes data points with a certain timestamp
• MAX: Calculates the maximum value in a time range
• MIN: Calculates the minimum value in a time range
Functions that calculate a value, such as SUM, can be aggregated in time peri-
ods. Time series databases provide first-class support for queries like “average of
temperature grouped in blocks of 7 minutes” and “highest CPU usage for every
hour”.
12
Visualisation is an example that relies heavily on these features. To provide users
with flexible visualisation options, the database needs to support, or at least facil-
itate, the above functions.
3.1.3 Common characteristics of time series data
While time series are used in different industries for a variety of use cases, in
general, time series data have the following characteristics:
• In-order data arrival: Data will, with rare exceptions, arrive with ascend-
ing time stamps.
• Updates are non-existent: Changing data points are rare and not part
of normal operations.
• Deletion is rare: It is uncommon for individual data points to be deleted,
but it may be common to retire a large amount of data points at a time, for
example, when data points are being retired as part of a retention policy.
• TSDB-specific functions may be heavily used, depending on the ap-
plication.
• Data values follow a pattern: There might be trends, cycles, seasonal and
non-seasonal cycles. It’s rare for time series data to be completely random.
3.1.4 Differing characteristics of time series data
While time series data have general characteristics, series may diverge on the
following properties:
• Regularity: In regular time series, data points are spaced evenly in time.
Irregular time series do not emit data points regularly. Irregular time series
are often the result of event triggers.
• Volume: High volume time series may emit hundreds of thousands of data
points a seconds, while low volume time series only emit one event a day.
13
• Data type: Traditionally, values of data points in a time series have been
integers or floating point numbers. But they can also be booleans, strings or
even custom data types.
• Tags: A time series data point may have one or more tags associated with
the timestamp and value. There may be no tags or a lot of tags. Tags may
hold special values, such as geospatial information.
• Tag value cardinality: The number of possible combinations the tag values
make. Three tags with two possible values each make a tag value cardinality
of six.
• Variation: While time series data usually follow a pattern, the variation
in a series may be very different. One series may describe a flat line, while
another may describe seasonal variations with daily spikes.
3.1.5 Industry use cases
Internet of Things and sensor data
The Internet of Things revolution has made it possible to connect devices to the
internet that were previously only available as offline systems. These devices can
be split up in two categories: actuators, to which commands can be sent to perform
an action, and sensors, which sense the current environment and translate physical
quantities into digital values.
The values sent from these sensors and the usual analyses performed upon them
are a natural fit for time series databases. Every data point generated by a sensor
is associated with a timestamp (the time at which it was produced). The frequency
of data generation depends on the application domain, common intervals are every
minute, every ten minutes and every hour.
Common operations on sensor data include getting the most recent data points,
averaging data points over time intervals and flexible visualisation. IoT data sets
are usually regular, low volume for small amounts of sensors, and often makes use
of geospatial tags.
14
Financial
Time series have long been a subject of study in financial disciplines. Stock in-
formation, exchange rates and portfolio valuations can all be represented as time
series, thus a time series database is a logical choice to store financial data points.
For example, kdb+, a time series database developed by Kx Systems, is often used
in high-frequency trading. kdb+ also explicitly presents other financial use cases,
such as algorithmic trading, forex trading, and regulatory management.
Financial time series are regular, but differ greatly in volume. Data points may
be produced every day (e.g. stock closing prices) to every few milliseconds (e.g.
high-frequency trading).
DevOps and machine monitoring applications
In the operations and DevOps industries, TSDBs are used extensively to monitor
computer systems and software applications. Common metrics include processor
load, memory usage and application response times. Metrics are usually aggre-
gated on the device they are collected from in one minute intervals before being
sent to a metrics collector.
The collected data are used for manual analysis (e.g. “What is the slowest com-
ponent in our stack?”), alerting (e.g. sending an alert when the average load is
above 90% for more than 5 minutes) and automatic anomaly detection.
Software monitoring and DevOps use cases produce regular time series that are
low volume for small amounts of machines and applications.
Asset tracking
Apart from software applications, time series databases are also often used to
monitor physical systems. Most time series databases include support for storing
and querying spatial data. This way, it is possible to associate location data.
15
Use cases include asset tracking (e.g. storing current location of vehicles at a point
in time) and geographical filtering (e.g. average of temperature for sensors within
a range).
Asset tracking use cases produce data points with geospatial information. Time
series produced can be regular (e.g. location is sent every minute), but is of-
ten irregular. Since asset tracking use cases involve tracking entities in a large
geographical area or in rough terrain, connectivity may be limited. This means
accurately determining position and transmitting that position may be impacted,
resulting in irregular time series.
Analytics
In analytics, time series may be used to monitor website visits, advertisement
clicks, or E-commerce orders.
Time series are used to track key perfomance indicators (KPIs) and infrastruc-
ture costs at Houghton Mifflin Harcourt [15]. KPIs can give an insight in the
performance of the business.
These use cases produce irregular time series, since they are based on events.
The volume may depend on various factors, such as the time (e.g. orders on a
Wednesday night compared to orders on Black Friday), the weather (e.g. umbrellas
sold in a convenience store), or other arbitrary factors (e.g. number of cars per
hour on a day with a train strike).
Physics experiment tracking
Time series databases have been used in physics experiments to capture and pro-
cess high volume data streams. For example, at CERN, the time series database
InfluxDB handles writes at a rate of over 700kHz [16].
Other use cases
Other use cases include game bot detection based on time series classification
[17], telecommunications forecasting based on usage pattern prediction and fraud
16
detection through pattern analysis.
3.2 A “good” benchmark
Chen et al.[18] consolidate the properties of a good benchmark based on previous
research as follows:
• Representative: Benchmarks must simulate real-world conditions, both
the input to a system and the system itself should be representative and
relevant.
• Relevant: Benchmarks must measure relevant metrics and technologies.
Results should be useful to compare widely-used solutions.
• Portable: Benchmarks should provide a fair comparison by being easily
extensible to competing solutions that solve comparable problems.
• Scalable: Benchmarks must be able to measure performance in a wide range
of scale. Not just single-node performance, but also cluster configurations.
• Verifiable: Benchmarks should be repeatable and independently verifiable.
• Simple: Benchmarks must be easily understandable, while making choices
that do not affect performance.
These properties can be used to put existing benchmarks to the test. Relevance
of individual benchmarks will not be evaluated. All of these benchmarks evaluate
time series databases. Since TSDBs are the fastest growing type of database [19],
we consider all benchmarks relevant.
3.3 Existing benchmarks
Here, existing benchmarks for time series databases are examined in more detail
and properties described in Section 3.2 are discussed.
17
3.3.1 TS-Benchmark
TS-Benchmark is a benchmark simulating a wind plant monitoring system.
• 3 Representative: TS-Benchmark uses a data model inspired by real world
applications. An ARIMA time series model is trained with real-world wind
power data [6].
• 3 Portable: TS-Benchmark targets InfluxDB, IoTDB, TimescaleDB, Druid
and OpenTSDB.
• 7 Scalable: Only single-node performance of database systems is tested. The
benchmark could be extended to perform on multi-node database systems.
• 3 Verifiable: The source code for TS-Benchmark was published on GitHub.
• 3 Simple: The benchmark follows a simple five-stage course, in which each
stage performs a single operation or test.
3.3.2 IoTDB-benchmark
In a recent paper, for now only published on ArXiv, Liu et al. describe IoTDB-
benchmark, a benchmark specifically designed for time series databases [7].
• 7 Representative: The data generator creates square waves, sine waves and
sawtooth waves with optional noise. Furthermore, constant values and ran-
dom values within a range can be generated. Care needs to be taken when
selecting a data generation function: rarely will real-world data follow a per-
fect sine function. This will have an effect on the compaction of data. To
ensure representativeness of data, the “random values within a range” func-
tion is the best approximation. However, depending on the use case, it will
still not be representative of most real-world data, where subsequent data
points may have a relatively low delta compared to other points close in time
instead of a completely random delta.
IoTDB-benchmark allows configuration of many data generation parameters,
such as the data type of fields, number of tags per device, etc.
18
• 3 Portable: IoTDB-Benchmark supports IoTDB, InfuxDB, OpenTSDB,
KairosDB, TimescaleDB, and CTSDB. The focus is on IoTDB, and not all
functions are supported in databases other than IoTDB. For example, gen-
eration and insertion of customized time series is only supported for IoTDB
at the moment.
• 7 Scalable: Only single-node performance of database systems is tested. The
benchmark could be extended to perform on multi-node database systems.
• 3 Verifiable: The source code for IoTDB-Benchmark was published on
GitHub.
• 3 Simple: The benchmark follows a simple six-stage course, in which each
stage performs a single operation or test.
3.3.3 TSDBBench/YCSB-TS
YCSB-TS, part of the TSDBBench benchmark, is a fork of YCSB that targets
time series databases, since these databases are not supported in YCSB.
• 7 Representative: YCSB-TS allows configuration of the workload used. Se-
lecting or creating a good workload is critical in ensuring that the benchmark
is representative. The standard workload is artificial and not based on real-
world data.
• 3 Portable: YCSB-TS supports InfluxDB, KairosDB, Blueflood, Druid,
NewTS, OpenTSDB and Rhombus.
• 3 Scalable: YCSB-TS has support for benchmarking multi-node set-ups.
Tests were performed with single-node set-ups and five-node set-ups[5].
• 3 Verifiable: The source code for all components of TSDBBench was pub-
lished on GitHub, along with instructions on how to replicate the benchmark.
• 7 Simple:
19
3.3.4 FinTime
FinTime is an older benchmark (it was proposed in 1999), but it still holds value
as a representative benchmark. It mimics financial industry use cases.
• 3 Representative: FinTime’s two models are based on real-world financial
use cases. Namely, it specifies data generation and queries for historical
financial market information and a tick database for financial instruments.
• 3 Portable: FinTime does not prescribe a query language. Implementations
have been created for SQL databases, but SQL is not required.
• 7 Scalable: The benchmark was performed on single-node database systems,
but could be extended to work on multi-node systems.
• 7 Verifiable: Only the source code for the data generation was published. It
is unclear how latency and throughput are measured.
• 7 Simple: Since FinTime is only a description of a data schema and queries
to be run, it requires manual implementation.
3.3.5 influxdb-comparisons
The influxdb-comparisons project is a benchmark created by InfluxData, vendor
of InfluxDB.
• 3 Representative: The influxdb-comparisons benchmark simulates a DevOps
use case, where a lot of different hosts send usage statistics (such as CPU
load, disk IO usage, etc.) to a time series database. This is a representative
benchmark for this scenario.
• 3 Portable: The benchmark currently supports seven different TSDBs.
• 7 Scalable: Only single-node performance is tested. The benchmark could
be extended to perform on multi-node database systems.
• 3 Verifiable: The source code for influxdb-comparisons is available under
the MIT licence on GitHub.
20
• 3 Simple: The benchmark follows a five-stage course, in which each stage
performs a single operation or test.
3.4 Evaluation of existing benchmarks
Table 3.1 shows the compiled evaluation of existing benchmarks.
Benchmark Representative Rev
elan
t
Por
table
Sca
lable
Ver
ifiab
le
Sim
ple
TS-Benchmark For IoT use cases 3 3 7 3 3
IoTDB-benchmark 7 3 3 7 3 3
TSDBBench 7 3 3 3 3 7
FinTime For financial use cases 3 3 7 7 7
influxdb-comparisons For DevOps use cases 3 3 7 3 3
Table 3.1: Evaluation of existing TSDB benchmarks
3.4.1 On scalability
Scalability is a gap in the current state of the art. Only one benchmark, TS-
DBBench, tests multi-node performance. Testing multi-node set-ups is often
harder due to either long manual or error-prone automated test set-up provision-
ing.
When TSDBS are actually deployed in the real world, multi-node setups are the
norm. Benchmarks should reflect this. Actually supporting multi-node setups in
a benchmark is usually not hard, but configuring, setting up, and comparing these
setups takes a lot of time.
Most benchmarks are able to test multi-node setups, due to the fact that most
distributed TSDBs present a single interface: the client application does not need
to be aware of the clustered nature of the TSDB.
21
3.4.2 On representativeness
As mentioned in Section 3.2, representativeness means that benchmarks must sim-
ulate real-world conditions, both the input to and the system itself. For the system
itself, this means no configuration tuning that would not be used in real produc-
tion systems, running benchmarks on system configurations that reflect systems
on which production databases would run, etc. For the input to the system, that
means real world data and real world queries, or data and queries comparable
to real world usage of them. Representativeness is import for generalisation pur-
poses: we can not generalize the results of a benchmark to real world usage if the
benchmark is not representative of real world usage.
TS-Benchmark, FinTime and influxdb-comparisons seem to be representative
benchmarks, but this is only true for specific domains. The results of FinTime are
only valid in financial contexts, for influxdb-comparisons only in specific DevOps
contexts. This leads to false generalisations: we can not make conclusions on the
performance of a database as a whole when a benchmark simulating a single use
case is used.
Tay [20] and Zhang et al. [21] have made the case for application-specific bench-
marking: instead of using generic micro-benchmarks, real world data are either
used directly to benchmark a system or used to construct a representative bench-
mark.
Since the use cases of time series databases are broad, it is necessary to develop
benchmarks that test a variety of representative scenarios. At the moment, no
such benchmarks exist.
3.5 Contribution
This dissertation discusses the design, technical implementation and results of a
representative benchmark. It compares three representative workloads to a base-
line. The representative workloads use existing real world time series data sets
22
and are chosen to simulate environments and use cases in whichk TSDBs are often
used.
Evaluation of the results of the benchmark will determine if representative bench-
marks are a necessity, or if non-representative benchmarks accurately predict perfo-
mance for representative workloads. If non-representative benchmarks can predict
real world performance, then representative workloads are not needed, which may
lead to simpler benchmarks. If non-representative benchmarks can not accurately
predict real world performance, validity of non-representative benchmarks can be
called into question.
A NEW BENCHMARK 23
Chapter 4
A new benchmark
In the previous chapter, current benchmarks have been examined, and their insuf-
ficient representativeness has been noted. This may present a problem for general-
isation of their results: do they accurately model real world performance? In this
chapter, a new benchmark will be described, with a focus on representativeness.
This benchmark will be used to test both representative and non-representative
workloads to examine differences in performance.
4.1 Benchmark components
A benchmark consists of multiple separate components. The workload data set
characteristics are the time series data characteristics described in Section 3.1.4.
Workload query characteristics are comprised of characteristics of the queries them-
selves, and the spread between query types. Finally, the metrics measurement
component will be considered.
4.1.1 Workload data set characteristics
Apart from the time series data characteristics discussed in Section 3.1.4, time
series data sets can be categorised as synthetic or real world and high existing
volume and low existing volume.
24
Synthetic workload data sets use tunable synthesizers that can generate workload
data sets [22]. These workloads may trade configurability for representativeness,
and care should be taken in their configuration. Real world data will be used as
the workload for this benchmark.
High existing data volumes may influence database performance. For big data
sets, a DBMS may need to scan large amounts of data.
4.1.2 Workload query characteristics
In Section 3.1.2, the functions of TSDBs were defined. These lead to possible
queries, such as reading single data points, averaging data points values within a
time range, and summation of all data point values with a certain tag.
Not only is the type of queries important, but the relative frequency of the query
type compared to all query types. For example, an application may frequently
insert new data, while calculating the maximum data point value is done infre-
quently.
Concurrency may play an important role when benchmarking queries. When mul-
tiple queries are run, performance may degrade, especially when read and write
queries are mixed. In this dissertation, mixed read and write queries are not con-
sidered. Write queries will be considered in an ingestion benchmark, and read
queries will be considered in load testing and latency testing benchmarks.
4.1.3 Measurement characteristics
The last benchmark factor is the measurement component. This component mea-
sures the effective performance of operations performed. The metrics surveyed
may be latentcies, network usage, storage requirements, etc. Care must be taken
that the measurement component minimally influences the benchmark results. For
example, an ingestion client could monitor the number of data points per second
sent to the database: this requires no instrumentation on the database server and
thus minimally disturbs it.
25
4.2 Design of a representative data workload
In Section 3.4.2, it was argued that representativeness is dependent on industry
and use cases. Therefore, as the workload data set for the time series database
benchmark, four different data sets will be considered. These are selected to be in
different domains, with different time series characteristics. To ensure representa-
tiveness, data sets with real data are used. These are selected to model real world
use cases for time series databases.
Of course, four different data sets do not cover every industry or use case. How-
ever, analysis of the results of benchmarks using these workload data sets will
allow comparisons that indicate if the considered use case has an influence on
performance.
4.2.1 A baseline workload
This is a non-representative workload, to be used as a baseline for comparison
with representative workloads. Data points are written to one metric with random
values and random tags.
• Metrics: Only one metric is tracked: “benchmark”. All data points belong
to this metric.
• Regularity: The time series is fully regular, with one data point being
produced every second.
• Volume: Low volume. There is only one metric where a data point is
produced every second. There are no spikes of traffic.
• Data type: For this benchmark, floating point numbers will be used to
represent the data point values.
• Tags: Every data point is tagged with two random tags. The possible values
of the first tag are TAG 1 00 to TAG 1 99 and the possible values of the second
tag are TAG 2 00 to TAG 2 99.
26
• Tag value cardinality: High. There are 10,000 (two tags with 100 possible
values each) possible tag combinations.
• Variation: High. The values are randomly generated for every data point.
Data point values bear no relationship to previous values. The values are
floating point numbers between 0 and 100 inclusive.
4.2.2 A financial time series workload
Time series data are often used in financial analysis. Prices of commodities, fu-
tures, assets, and other financial instruments produce time series [23]. This his-
torical data can then be used in performance calculations, price prediction, and
financial ratio calculation.
The data set used for this benchmark was created by Boris Marjanovic and pub-
lished on Kaggle [24]. It is licenced under CC01. The data set contains historical
data for 1344 Exchange-Traded Funds (ETFs) and 7195 stocks. For each stock
and ETF, it lists the open, high, low and closing prices, next to the volume2 and
open interest3 for every day the ETF or stock was trading.
• Metrics: Six different metrics are tracked: the opening, high, low and clos-
ing prices for the stock, and the volume and open interest for the stock.
• Regularity: Semi-regular. Every day, an update is published, except on
weekends and market closings (such as holidays). It is rare for new stocks to
be published or for existing stocks to be removed from the exchange.
• Volume: Low volume, with short bursts. Data are published at market
closing, which is the same time every day. This may lead to short spikes of
high traffic when a lot of stocks are tracked.
• Data type: Prices are represented by numbers with five digits past the
decimal points. Floating point numbers are sometimes not used to store these
1Creative Commons 1.0 Public Domain Dedication, which dedicates this work to the public
domain.2The total number of shares traded during a day.3The number of outstanding contracts that have not been fulfilled.
27
prices, due to the possible inaccuracies and high cost of processing floating
point operations. Instead, the prices are multiplied by 105 and saved as
integers. This does place a burden on client applications if the database does
not perform this conversion itself, therefore, they will be saved as floating
point numbers for this benchmark.
• Tags: Only a single tag is saved: the ticker symbol. Ticker symbols are
strings, for which no general format is specified: every exchange specifies
their own rules. In general, the symbol length is short (nine is the maximum
length in the data set), alphanumeric (and additionally contain no numbers
for the data set) and case-insensitive. As an example, Apple’s stock ticker
symbol is AAPL.
• Tag value cardinality: Medium. There are 7,164 possible tag values. For
the first one million data points, the tag cardinality is 143.
• Variation: Low. While stock prices are volatile, it is rare for stocks to have
high changes in the span of a day.
4.2.3 A rating system workload
Rating systems allow customers and consumers to rate their experiences of goods
and services. Users can like or dislike products, leave comments about a restaurant
visit, or leave a rating for sellers on online marketplaces. Commonly, this feedback
is represented as a five-star system, where half a star represents the lowest score,
and five stars represents the maximum score.
GroupLens Research created data sets of varying sizes from the MovieLens website,
which allows users to rate movies with a five-star system [25]. The MovieLens 20M
data set contains twenty million ratings and is the basis for this workload. The
data set comes with a custom license, allowing non-commercial use, but forbidding
redistribution.
• Metrics: Only one metric is tracked: ratings. The value of the data point
is the rating the user gave a movie, and the timestamp is when this rating
was published.
28
• Regularity: The time series is irregular. The data points are events, pro-
duced when a user leaves a review.
• Volume: Approximately one review was left every thirty seconds. This is
not a high level of activity, we can therefore qualify this time series as low
volume.
• Data type: The ratings are floating point numbers, between 0.5 and 5.0 in
0.5 increments. This leads to ten possible values.
• Tags: Five tags are associated with every data point: userId (integer, the
identifier of the user who left the review), title (string, the title of the
movie being reviewed), imdbId (integer, the identifier of the movie on the
Internet Movie Database4), tmdbId (integer, the identifier of the movie on
The Movie Database5), genres (string, a list of genres the movie belongs to
encoded as a string).
• Tag value cardinality: High. There are 138,493 different users, and 26,212
different movie titles. The rest of the tags are dependent on the movie title
(the title directly implies the genre and external identifiers). Since not every
user has rated every movie, the tag cardinality is not the multiplication of
these two figures. The tag cardinality of the complete data set was deter-
mined to be 20,000,262 and the tag cardinality of the first one million data
points was determined to be 1,000,000.
• Variation: Subsequent points do not relate to each other, since they are
ordered by timestamp and not the movie reviewed. This leads to a high
variation. However, the absolute variation is still small, since the maximum
absolute variation is 4.5.
4.2.4 An IoT workload
IoT applications, in particular sensor applications, produce a lot of data. This can
be temperature data, power consumption, location data, etc. IoT data are almost
4https://www.imdb.com/5https://www.themoviedb.org/
29
always temporally indexed, thus a time series database is a natural fit.
The UCI6 Machine Learning Repository [10] contains the Individual household
electric power consumption Data Set, a data set which records power information
for a house every minute. It was created by Georges Hebrail and Alice Berard and
released under the CC BY 4.07 license.
• Metrics: Seven metrics are tracked for the household: active and reac-
tive power, voltage, intensity (current), and three power meters for different
rooms.
• Regularity: The data set is regular. Every minute, a new data point is
emitted. Data are missing for a small period of time, and for these missing
data points, the values were filled in with zeroes.
• Volume: Only seven data points are emitted every minute. This makes the
data set low volume.
• Tags: The data contains no tags.
• Variation: Variation between subsequent data point values is low due to
the small sampling interval.
4.2.5 Workload data set overview
Table 4.1 shows an overview of all used workload data sets.
4.2.6 Data set pre-processing
The data sets were pre-processed using Python. All data sets are denormalized
as to provide one data point per line in the resulting file. Every line provides a
complete data point, including the timestamp, metric name, data point value, and
(potentially) tags.
6University of California, Irvine7Creative Commons Attribution 4.0 International
30
Baseline Financial Rating IoT
Metrics 1 6 1 7
Regularity Regular Semi-regular Irregular Regular
Volume Low volume Low volume Low volume Low volume
Tags 2 1 5 0
Tag value cardinality 10,000 7,164 20,000,262 0
Variation High Low High Low
Total data points 20,000,000 74,418,459 20,000,262 14,526,812
License CC0 Custom CC BY 4.0
Table 4.1: Overview of workload data sets
4.3 Design of a representative query workload
A representative data set is only part of the workload. Representative queries on
these data sets are the other. While real world data sets are readily available, in-
formation on data usage or queries performed on these data sets is not. Therefore,
for every data set, logical queries and patterns will be created. For a truly rep-
resentative query workload, existing TSDB systems should be surveyed and their
usage patterns monitored.
The implementation of query workloads was complicatied by the fact that every
database use a custom query language. These may have different semantics. For
example, when grouping a time range by week, some TSDBs will start the grouping
block on the start timestamp of the given range, others will align the groups by
the calendar (so the first block may not be a full week). Query results have to
be compared to ensure correctness. A standardized query language, as SQL is for
RDBMS, would speed up development of benchmarks and TSDB applications.
4.3.1 Queries for the baseline workload
The baseline workload is a non-representative workload to which others will be
compared. The query workload reflects this: there is only one query, requesting a
single data point between two timestamps with two specific tags.
31
4.3.2 Queries for the financial workload
The financial query workload simulates a stock information application which in-
forms stock traders of historical statistics. The following queries are run:
• Get all opening prices for a stock in a time range (relative frequency: 0.20)
• Get the minimum closing price for a stock (relative frequency: 0.25)
• Get the maximum opening price for a stock (relative frequency: 0.15)
• Get the mean high price for a stock grouped by week (relative frequency:
0.25)
• Get the total volume for a stock grouped by four weeks (relative frequency:
0.15)
4.3.3 Queries for the rating workload
The ratings query benchmarks simulates the backing database of a movie website.
The queries get the average rating for a movie with a title or IMDb identifier, get
ratings for a particular user and group average ratings for a movie by year:
• Get the mean rating for a movie with a specific title (relative frequency:
0.70)
• Get the mean rating for a movie with a specific IMDb identifier (relative
frequency: 0.10)
• Get all ratings by a specific user (relative frequency: 0.05)
• Get mean rating per year for a movie with a specific title (relative frequency:
0.15)
32
4.3.4 Queries for the IoT workload
The IoT query workload mimics an power consumption application. Mean power
for week (possibly grouped by day) and for a three month time range, active power
is grouped by week:
• Get mean active power for a one week time range (relative frequency: 0.4)
• Get mean active power for a two week time range grouped by day (relative
frequency: 0.4)
• Get mean active power for a twelve week time range grouped by week (rela-
tive frequency: 0.2)
4.4 Metrics
Ingestion throughput is the number of data points per second that can be
inserted into the database, possibly using a bulk loading mechanism. This metric
is especially important for OLAP applications, where data from a master database
is loaded into a TSDB for time series analytics processing.
Space consumption is the amount of storage required to store the database.
Storage efficiency is space consumption divided by the number of data points
stored. This metric shows how efficient the database engine is at compressing data
points. The measurement is taken after loading the database with a predefined
set of data points, and is expressed in bytes per data point.
Latency, expressed in mean, 95th, and 99th percentile response times, shows how
fast the database can answer queries. For user-facing applications, this is especially
important: applications need to render quickly, or users leave.
Load testing gives us the maximum number of requests per second a TSDB
can handle.
The mean response size is the average size in bytes of the returned TSDB
response body. This response body may contain, next to the requested data,
33
metadata, such as the number of data points used in calculation, aggregated tags,
etc. While this information may be useful to some applications, in general the
TSDB response is preferred to be small. This leads to faster responses, lower
network load and lower memory requirements, though the effects may be small.
4.5 Technical implementation
4.5.1 Test environment
The tests were run on homogenous machines containing two Quad core Intel E5520
(2.2GHz) CPUs, 12GB RAM and a 160GB harddisk. The devices were connected
via Gigabit Ethernet.
The versions of the databases used are as follows: OpenTSDB 2.3.1 (with HBase
1.4.4),InfluxDB 1.5.4, KairosDB 1.2.1 with either ScyllaDB 3.0.6 or Cassandra
3.11. These databases were minimally changed from their stock configuration. For
InfluxDB, the maximum number of series was increased, for OpenTSDB, chunked
requests were enabled, and for KairosDB (with ScyllaDB) the maximum batch
size was decreased to one hundred for the financial workload.
Databases used were run in Docker containers (one container for OpenTSDB
and InfluxDB, two containers for KairosDB with either underlying DBMS in a
docker-compose setup). When not under test, containers were stopped. Only
one container was under test at a given time and no other applications were active
on the database host, apart from basic monitoring software.
During tests, one machine acted as the database host, while the other loaded the
data or performed queries.
4.5.2 Data ingestion
Data loaders from the influxdb-comparisons project [26] were used. These load
the data sets, converted for use with a specific database, into that specific database.
Since no data loader was available for KairosDB, its Telnet API was used.
34
4.5.3 Load and latency testing
Vegeta [27], a load testing tool, is used to test latencies. Every second, a data
set-specific number of requests is made to the TSDB. There are twenty queries in
every query workload, and each one is translated to the query language of every
TSDB. The queries are cycled in a round-robin pattern, as to ensure determinism.
http load [28] is used to conduct load testing. The program is configured with
a thirty second timeout, ten parallel requests and a thirty second run time. The
same URLs as the latency testing measurement are loaded and ten requests are
made in parallel. When one finishes, another one starts. Afterwards, the number
of requests per second is reported.
4.6 Design evaluation
In Section 3.2, properties of a good benchmark were discussed. Now, these will be
applied to the benchmark described in this chapter.
• Representative: Through the use of multiple use cases, real world, non-
synthetic data sets, and balanced query workloads, this is a very representa-
tive benchmark.
• Relevant: This benchmark evaluates TSDBs. As the fastest growing type
of database [19], this can be considered a relevant benchmark. The met-
rics measured are based on other database and TSDB benchmarks, and are
comparable with them.
• Portable: To add a new database to the benchmark, the following com-
ponents are necessary: a Docker container containing the database, a data
formatter, a data ingestion loader, and a set of queries as HTTP requests.
Most open source databases have existing Docker containers, and creating a
data formatter is a few hours of work. A data ingestion loader is more time-
consuming, but many databases have existing ingestion loaders. The last
component presents a challenge: not every database has an HTTP interface.
For example, TSDBs that rely on an existing RDBMS, such as Timescale
35
(built on PostgresQL), do not include an HTTP API. To benchmark this
kind of databases, the benchmark would need to be extended to include
other measurement tools. This does make comparison of results harder.
• Scalable: Both the ingestion and querying component of the benchmark
are able to accept a list of different URLs to spread the load. This makes
the ingestion and measuring component of the benchmark scalable. However,
tests were only conducted on single-node TSDB setups. Multi-node database
setups are hard to set up right, and it is especially hard to fairly compare
heterogenous DBMSs, such as TSDBs.
• Verifiable: The data sets used are available under open licenses (Section
4.2.5), the tools to ingest are available under the MIT licence [26], and the
tool to test latencies and response size is available under the MIT license.
The components to denormalize the data sets, to transform them to specific
database formats, and the database setup components will be made available
as open source when the embargo on this master’s dissertation ends.
• Simple: The benchmark was kept as simple as possible, with distinct parts
doing a single thing. This leads to an architecture where one component can
easily be swapped with another, e.g. the data generator could be switched
with a generator from another benchmark.
RESULTS 36
Chapter 5
Results
The results of the data ingestion of the data set workloads described in Section
4.2 and the query workload upon those data sets described in Section 4.3 are
presented here. The metrics reported are described in Section 4.4. The results are
analysed to examine possible performance differences between non-representative
and representative workloads.
5.1 Storage efficiency
One million data points were inserted in TSDBs. Afterwards, the database was
shut down, and the size of the data directory of every TSDB was measured. This
includes raw database files and write-ahead logs. As a comparison, the storage
efficiency of CSV files is included. Figure 5.1 shows the results graphically.
InfluxDB performs nearly as well as CSV for the baseline data set workload, but
is much less efficient for the representative data set workloads.
OpenTSDB and KairosDB require at least one tag to be present on the data points.
Therefore, the tag notags with the string value "true" was added on the IoT data
set, which contains no tags otherwise. This may influence storage consumption,
but both TSDBs performs very well for this data set workload nonetheless.
OpenTSDB outperforms every other TSDB for all representative data set work-
37
Baseline IoT Financial Ratings
101
102
103
10482.5
9
114.
84
171.
35 583.
34
138.
45
8.91
46.1
2
71.5
7154.
3
77.0
7 182.
72
264.
24
1,07
4.73
1,07
4.73
1,07
4.73
57.1
8
34.4
8
29.4
4 80.3
2
byte
s
InfluxDB OpenTSDB KairosDB-CassandraKairosDB-ScyllaDB CSV
Figure 5.1: Storage efficiency of different TSDBs in bytes per data point. Data points
contain a timestamp, a value, and may contain tags, depending on the data
set.
Baseline IoT Financial Ratings0
10
20
30
40
1.44 3.
33 5.82 7.26
2.42
0.26 1.57
0.892.
7
2.24
6.21
3.29
18.7
9
31.1
7 36.5
1
01 1 1 1
rela
tive
size
InfluxDB OpenTSDB KairosDB-CassandraKairosDB-ScyllaDB CSV
Figure 5.2: Relative storage efficiency of different TSDBs per data point compared to
the CSV source format.
38
loads. It shows exceptional performance for the IoT dataset, where it is able to
store data nearly four times as efficient as the CSV input data set. This is likely
a result of the low tag value cardinality: there is only one tag and one tag value1.
KairosDB (with Cassandra) performs well for the IoT data set workload, but does
not do better than the CSV source data set. It always uses at least twice as much
storage space as the source data set.
KairosDB (with ScyllaDB) was unable to complete data loading for the ratings
data set. For the other data sets, it used exactly 1074.73 bytes per data point
to store all data, regardless of the data size. To ensure these three measurements
were correct, they were repeated, and the same values were found. The fact that
the persisted data size is so large is remarkable, since the ScyllaDB uses the same
storage format as Cassandra [29].
When comparing the relative storage efficiency compared to CSV (graphically
displayed in Figure 5.2), the impact of high tag value cardinality becomes clear.
Tag value cardinality is the number of possible combinations tag values can make.
InfluxDB in particular requires relatively more storage space to store higher tag
cardinality for the representative data set workloads. Other TSDBs display no
such dependency on tag value cardinality. Variation may also play an important
role. The representative workloads have lower data point value variation than
the baseline, especially the IoT data set. This may enable OpenTSDB to more
efficiently store the time series.
It is clear that representative data set workloads allow to see patterns not uncov-
ered by a traditional data set workload. A non-representative benchmark might
appoint InfluxDB the winner of a storage efficiency test, while it is clear that, on
the given representative domains, OpenTSDB has much better storage efficiency.
1The original data set does not contain any tags, but since OpenTSDB requires at least one
tag for data points, the tag notags with the string value "true" was used.
39
5.2 Data ingestion throughput
The data ingestion throughput or data ingestion rate is the number of data points
a TSDB can ingest per second in a bulk loading pattern. Ingestion rate tests were
performed with data sets with one million data points and the results are shown
in Figure 5.3.
InfluxDB outperforms the other TSDBs for all data set workload ingestion tests,
but performs exceptionally worse at the intake of the ratings data set workload,
where its ingestion is seven times slower than KairosDB (with Cassandra) and
nearly five times slower than OpenTSDB. This is likely due to the high series
cardinality.
OpenTSDB performs better than KairosDB, but is still significantly slower than
InfluxDB. For the non-representative baseline data set workload, OpenTSDB is
nearly five times slower than InfluxDB. For representative data set workloads, this
gap shrinks. InfluxDB is just over twice as fast as OpenTSDB for the IoT and
financial data set workloads. For the ratings data set workload ingestion test,
OpenTSDB is nearly five times as fast as InfluxDB.
KairosDB (with ScyllaDB) was unable to complete for the ratings data set work-
load. The ingestion speed was 33,340 data points per second, but since not all
data points were successfully saved, this result is excluded.
The differences between KairosDB with Cassandra and KairosDB with ScyllaDB
not huge, but ScyllaDB consistently outperforms Cassandra. For the baseline data
set workload, ScyllaDB performs 8.10% better, for the IoT and financial workloads
5.55% and 12.95% respectively.
For the IoT and financial data set workloads, relative performance is comparable
to the baseline. InfluxDB comes in first, OpenTSDB second, followed by KairosDB
with ScyllaDB and KairosDB with Cassandra, respectively. However, for the rat-
ings data set workload, we see a different pattern. Here, InfluxDB has the slowest
ingestion speed, and KairosDB with Cassandra the highest, with OpenTSDB in the
middle. The reason for this is unclear. High tag value cardinality has been known
40
to slow down InfluxDB performance through high memory usage, but InfluxDB
performed well on the baseline, which also has high tag value cardinality. This
performance may be caused by the high size of the data points, and the large
amount of tags.
The use of real world, representative data sets revealed a performance degradation
of InfluxDB compared to the other TSDBs for the ratings data set.
5.3 Load testing with query workload
The maximum number of queries per second was determined for every TSDB-data
set tuple. The results are shown in Figure 5.4.
InfluxDB significantly outperforms all other TSDBs for every query workload. In
the non-representative query workload, it outperforms the next runner-up
(OpenTSDB) by a factor of 18. In the representative query workloads, this factor
is different. For the IoT query workload, InfluxDB performs 8.5 times better than
OpenTSDB, for the financial query workload 15 times better, and for the ratings
query workload nearly 37 times better. Clearly, the query workload has a big
impact on performance.
KairosDB with Cassandra was not able to complete the ratings workload due
to memory constraints. KairosDB with ScyllaDB was not able to complete this
query workload because the not all the data could be loaded (see Section 5.2). For
the other query workloads, ScyllaDB outperforms Cassandra every time. In the
baseline query workload, it outperforms by just over 20%. For the representative
IoT query workload, it achieves 36.49% more requests per second, and for the
financial query workload a 22% improvement.
It is remarkable how KairosDB performs much better on the representative work-
loads. The baseline workload requests just one data point, a very simple query
which is easily cacheable, and yet, KairosDB performs two times better on the more
representative, but much more complex IoT and financial benchmarks. There re-
ally is no clear explanation why KairosDB would perform so much worse for a
41
Baseline IoT Financial Ratings
104
105
1064.
82·1
05
3.18·1
05
1.56·1
05
4,34
2
89,3
60 1.62·1
05
86,5
78
21,1
96
54,7
92
87,4
13
78,1
98
29,9
1359,2
31
98,7
36
82,5
35
Dat
ap
oints
per
seco
nd
InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB
Figure 5.3: Data points ingested per second. Data sets used were one million data
points each.
much simpler workload. If anything, it would expected to perform a lot faster
than the IoT workload, since the baseline only requests a single data point (which
can be cached) and requires no aggregation or calculations.
OpenTSDB performance is good in the baseline and the IoT query workload, but
is degraded in the financial and ratings query workload. This may have to do
with the fact that the data ranges to scan are much bigger in these last two query
workloads, while the first two only require data from relatively narrow time ranges.
5.4 Response latency
The mean latency, shown graphically in Figure 5.5, is the mean time it takes to
receive a response from the TSDB. The 95th percentile response time, displayed
graphically in Figure 5.6. This metric displays what the maximum latency for 95%
of requests is. One in twenty requests will have a longer latency than this.
42
Baseline IoT Financial Ratings
100
101
102
103
104
105
6,40
0.36
997.
57
235.
73
78.3
3347.
37
117.
3
15.5
7
2.13
12.8 29
.43
26.8
3
15.4
7 40.1
7
28.5
7
Req
ues
tsp
erse
cond
InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB
Figure 5.4: Maximum requests per second. Tests were performed on data sets one
million data points in size.
The tests were performed with a constant rate of requests. This rate was de-
termined by choosing the lowest maximum requests per second for every query
workload. Empirically, the request rate was increased until timeouts were ob-
served, this request rate was then rounded down. It was found that some TSDBs
were able to handle more requests per second than the load testing showed when
the number of parallel requests was increased while leading to little increase in
latency. Ultimately, the used rates were rounded to 10 requests per second for the
baseline query workload, 20 requests per second for the IoT query workload, 30
requests per second for the financial query workload, and 2 requests per second
for the ratings query workload.
KairosDB with Cassandra and KairosDB with ScyllaDB were not able was not
able to complete the ratings workload due to memory constraints and because not
all data could be loaded (see Section 5.3).
InfluxDB is the clear winner when it comes to latency. The TSDB is able to
handle requests and send a response in less than 2ms for the baseline, and queries
43
for the complex ratings query workload take on average just over 100ms. InfluxDB
outperforms all other TSDBs tested when it comes to latency, both mean latency
and 95th percentile.
OpenTSDB shows good performance for the baseline and IoT query workloads, but
like the load testing, has trouble with the financial and ratings query workloads. As
mentioned in Section 5.3, this may have to do with the big time ranges the TSDB
has to scan to aggregate data points. The latencies for the last two workloads are
high: the average latency is over two and a half seconds.
KairosDB with ScyllaDB shows greater performance than KairosDB with Cassan-
dra for every query workload. For the first two workloads, it performs nearly twice
as fast when comparing mean latency. For the financial workload, the difference
(ScyllaDB 4.63% faster) is small.
5.5 Mean response size
In Figure 5.7, the mean response size is shown graphically. This mean is clearly
coupled to the data set. Overall, InfluxdDB has the most verbose responses.
After inspecting a few responses, the main reason for this seems to be due to the
fact that InfluxDB encodes timestamps as strings in responses, while KairosDB
uses numbers, and OpenTSDB uses numbers encoded as strings. Compare these
encodings:
• KairosDB encodes the time as 1189641600000, representing the number of
milliseconds since January 1, 1970. This takes 14 bytes to encode in JSON.
• InfluxDB encodes the time as "2007-09-13T00:00:00Z", which takes 23
bytes to encode. However, this format is able to add more precision, adding
seconds, milliseconds and even nanoseconds.
• OpenTSDB encodes the time as "1189641600", representing the number of
seconds since January 1, 1970, as a string. This takes 13 bytes to encode in
JSON, but is not as precise as the other encodings.
44
Baseline IoT Financial Ratings
100
101
102
103
1041.
27
7.64
57.8
8
104.
41
12.5
6
18.2
9
2,68
1.44
2,56
3.02
862.
64
155.
91
106.
85
136.
83
74.4
9
102.
12
Lat
ency
(ms)
InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB
Figure 5.5: Mean latency per request.
Baseline IoT Financial Ratings
100
101
102
103
104
1.4
22.0
5 87.2
8
462.
79
12.8
5 44.9
3
3,99
1.06
2,78
6.61
124.
64 334
133.
17
70.5
3 193.
67
127.
22
Lat
ency
(ms)
InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB
Figure 5.6: 95th percentile of latency per request.
45
Other factors influence the response size. For example, OpenTSDB and KairosDB
will return a list of tags used on data points. For large response size, such as for
the representative query workloads, the timestamp encoding is the deciding factor.
Both KairosDB TSDBs experienced timeouts for the baseline data set. For
KairosDB on Cassandra, 27 timeouts were encountered, and KairosDB on Scyl-
laDB encountered 4 timeouts. These were ignored when calculating the mean
response size.
KairosDB on Cassandra and on ScyllaDB both return the same number of bytes for
the IoT and financial workload since the underlying databases are interchangeable.
Given the same data, KairosDB should deliver the same response, and this result
gives confidence that it does2.
5.6 Evaluation
When comparing storage efficiency (Section 5.1), representative data sets showed
that storage efficiency varies heavily between use cases, and so does relative stor-
age efficiency. It showed that the results of the non-representative benchmark can
not be generalised to relative storage efficiency in representative workloads. Tag
value cardinality and data point value variation were identified as possible param-
eters that have a high impact on storage efficiency. Real world data usually has
low variation, while non-representative benchmarks often use random values (high
variation). These non-representative benchmarks may become more representative
of real world use cases through the use of random walks instead, which have lower
variation and more closely model real world data.
The use of representative data sets and query workloads for ingestion speed testing
(Section 5.2) showed performance problems when ingesting the complex ratings
data set, especially for InfluxDB.
In the load testing benchmark (Section 5.3), it was discovered that OpenTSDB
2Some ndividual queries were compared to further confirm that KairosDB with either Cas-
sandra ScyllaDB give the same response for the same query.
46
Baseline IoT Financial Ratings
102
103
104
10518
5
507.
35
33,1
86.1
1,25
0.35
126
350.
45
28,8
54.3
5
390.
65
202
459.
4
23,1
17.7
5
202
459.
4
23,1
17.7
5
Res
pon
sesi
ze(b
yte
s)InfluxDB OpenTSDB KairosDB-Cassandra KairosDB-ScyllaDB
Figure 5.7: Mean size in bytes of the TSDB response.
performed well for the baseline and IoT query workloads, but not for the financial
and ratings query workloads.
For the response latency (Section 5.4), the use of representative benchmarks again
showed a performance degradation for OpenTSDB for the financial and ratings
query workloads, which use broad time ranges. Otherwise, the baseline is a good
predictor for relative performance in the representative benchmarks.
When testing the mean response size (Section 5.5), the encoding of timestamps was
shown to be the deciding factor when it comes to query workloads which return a
large response.
These results make it clear that representative data set workloads and query work-
loads may lead to important differences in benchmark results. They shed doubt
on the real world applicability of benchmarks using random or synthetic data sets
and/or non-representative query workloads.
The fact that not all representative workloads show performance impact (e.g. only
47
the ratings workload showed the performance degradation for InfluxDB in the data
ingestion test) highlights the importance of using multiple representative workloads
- just one representative workload may not be enough to highlight possible devia-
tions or performance degradations. It is impractical to create a workload for every
use case, but it is possible to generalize workloads into categories (e.g. volume, tag
value cardinality, data type, ...). Further testing is needed to confirm that data
sets with the same workload parameters will yield comparable results.
CONCLUSIONS AND FUTURE WORK 48
Chapter 6
Conclusions and future work
6.1 Conclusions
Compared to a baseline non-representative workload, representative workloads
showed significant performance differences when it came to storage efficiency, data
ingestion speed for complex data, latency and maximum request rate (when broad
time ranges are used). Storage efficiency is lower for data sets with low tag value
cardinality and low variation. Non-representative benchmarks using random data
will have high variation, while real world data often displays low variation. Using
random walks instead of random values may make a benchmark more represen-
tative. Data ingestion throughput testing highlighted performance problems for
data sets with large data points and high tag cardinality. Latency and load testing
showed that some databases perform significantly worse when they need to scan
a large amount of data. This illustrates the importance of using representative
workloads.
A number of TSDB benchmarks have been studied, but none of them use repre-
sentative workloads. Three existing TSDB benchmarks use nearly representative
workloads, but none of them use real world data sets. Instead, they use random
or synthetised data. Considering that my benchmark, which uses representative,
real world workloads, sheds a different light on TSDB performance, the relevance
of these existing benchmarks may be called into question.
49
While representative workloads uncovered significant performance differences com-
pared to non-representative workloads, it is unpractical to create or test represen-
tative workloads for every use case imaginable, but TSDB workloads can be cat-
egorized with workload parameters (number of metrics, regularity, volume, data
type, number of tags, tag value data type, tag value cardinality, variation). Fur-
ther research is needed to determine if these parameters are enough to accurately
describe a TSDB workload and thus generalize results of one workload to another
with the same workload parameters.
Benchmark TSDBs is a complex endeavour due to the absence of standardized
query languages, data models, or capabilities (such as aggregators or functions).
The proliferation of TSDB models has the advantage of specialisation: instead
of optimizing for the general case, individual TSDBs may seek to specialise in a
niche, e.g. geo-spatial data querying, nanosecond timestamp resolution, or real-
time streaming queries. The disadvantage is that it is much harder to compare
different TSDBs. The varying support for operations makes it so that not all
TSDBs can be compared to each other, semantic differences in query languages
require careful comparison of results to ensure they are valid, and different database
interfacing methods may lead to more difficult interpretation of benchmark results.
6.2 Future work
This dissertation has proven the relevance of representative benchmarks. The
experiments and tests that were run for this dissertation took a lot of time to
prepare and execute, and therefore, a lot of extensions have been left for the
future. Several possible lines of research could be pursued:
• The hypothesis that workloads with the same data set characteristics yield
comparable benchmark results could be tested. Analysis might produce an-
other, non-obvious workload parameter.
• The benchmark described in this dissertation can be extended to use more
TSDBs. Currently, four TSDBs are tested, but more can be added. Another
approach would be to extend another existing TSDB benchmark to be more
representative.
50
• The query workload could be extended to include data mutations (such as
create, update and delete queries). Benchmarks using this query workload
might produce even more representative results. However, query spread
should be carefully studied: for most query workloads, create queries will
heavily outnumber update and delete queries.
• A comparison of TSDB query languages might yield interesing results on
their construction and capabilities. Perhaps a unifying query language could
be constructed, which would facilitate research into different TSDB families.
• In production environments, TSDBs are often used in multi-node setups.
This scalability aspect is only addressed in one existing benchmark. The
benchmark in thes dissertation could be extended to test clustered TSDBs.
• This dissertation has focused on TSDBs, a specialized type of database.
Representative benchmarking could be studied in different domains as well,
such as relational databases and specialized non-relational databases (such
as graph, triple or document stores).
DETAILED RESULTS 51
Appendix A
Detailed results
This appendix lists detailed results discussed and displayed graphically in Chap-
ter 5.
A.1 Data ingestion throughput
Table A.1 lists the detailed resuls for Section 5.2.
InfluxDB OpenTSDB KairosDB KairosDB
Data set Cassandra ScyllaDB
Baseline 481818 89360 54792 59231
IoT 317999 162473 87413 98736
Financial 156498 86578 78198 82535
Ratings 4342 21196 29913 NA
Table A.1: Data ingestion speed in point per second.
A.2 Storage efficiency
Table A.2 list the detailed results for Section 5.1.
52
CSV InfluxDB OpenTSDB KairosDB KairosDB
Data set Cassandra ScyllaDB
Baseline 1.0 1.4443 2.4213 2.6983 18.7948
IoT 1.0 3.3308 0.2585 2.2353 31.1704
Financial 1.0 5.8211 1.5668 6.2073 36.5115
Ratings 1.0 7.2624 0.891 3.2897 NA
Table A.2: Storage efficiency in bytes per data point.
A.3 Load testing
Table A.3 lists the detailed resuls for Section 5.3. Tests were performed using ten
requests in parallel, with a thirty second timeout.
InfluxDB OpenTSDB KairosDB KairosDB
Data set Cassandra ScyllaDB
Baseline 6400.36 347.367 12.8 15.4667
IoT 997.567 117.3 29.4333 40.1667
Financial 235.733 15.5664 26.8332 28.5667
Ratings 78.3333 2.13333 NA NA
Table A.3: Maximum requests per second performed using representative queries.
A.4 Response latency
Table A.4 shows the mean latency and Table A.5 for TSDB responses. Table A.6
shows the number of timeout that occurred during the latency and response size
tests. These results are discussed in Section 5.4
A.5 Mean response size
Table A.7 lists the detailed results for Section 5.5.
53
InfluxDB OpenTSDB KairosDB KairosDB
Data set Cassandra ScyllaDB
Baseline 1.266 12.559 862.643 230.125
IoT 7.636 18.293 155.91 74.49
Financial 57.88 2681.441 106.85 122.25
Ratings 104.41 2563.02 NA NA
Table A.4: TSDB mean request latency size for representative queries.
InfluxDB OpenTSDB KairosDB KairosDB
Data set Cassandra ScyllaDB
Baseline 1.399121 12.8529 124.640305 70.532287
IoT 22.049411 44.926242 333.99535 193.673539
Financial 87.277173 3991.057059 133.174898 127.224758
Ratings 462.786983 2786.607644 NA NA
Table A.5: TSDB 95th percentile request latency size for representative queries.
InfluxDB OpenTSDB KairosDB KairosDB
Data set Cassandra ScyllaDB
baseline 0 0 27 4
IoT 0 0 0 0
Financial 0 0 0 0
Ratings 0 0 NA NA
Table A.6: Number of timeouts during the latency and response size tests.
InfluxDB OpenTSDB KairosDB KairosDB
Data set Cassandra ScyllaDB
Baseline 185.0 126.0 202.0 202.0
IoT 507.35 350.45 459.4 459.4
Financial 33186.1 28854.35 23117.75 23117.75
Ratings 1250.35 390.65 NA NA
Table A.7: TSDB mean response size for representative queries.
BIBLIOGRAPHY 54
Bibliography
[1] E. F. Codd. A Relational Model of Data for Large Shared Data Banks.
Commun. ACM, 13(6):377–387, June 1970.
[2] Andrew Pavlo and Matthew Aslett. What’s Really New with NewSQL? SIG-
MOD Rec., 45(2):45–55, September 2016.
[3] Katarina Grolinger, Wilson A. Higashino, Abhinav Tiwari, and Miriam AM
Capretz. Data management in cloud environments: NoSQL and NewSQL data
stores. Journal of Cloud Computing: Advances, Systems and Applications,
2(1):22, December 2013.
[4] Rick Cattell. Scalable SQL and NoSQL data stores. ACM SIGMOD Record,
39(4):12, May 2011.
[5] Andreas Bader, Oliver Kopp, and Michael Falkenthal. Survey and Comparison
of Open Source Time Series Databases. Gesellschaft fur Informatik e.V., 2017.
[6] Yueguo Chen. TS-Benchmark: A benchmark for time series databases. http:
//prof.ict.ac.cn/Bench18/chenyueguo.pdf, June 2018.
[7] Rui Liu and Jun Yuan. Benchmark Time Series Database with IoTDB-
Benchmark for IoT Scenarios. arXiv:1901.08304 [cs], January 2019.
[8] Kaippallimalil J. Jacob and Dennis Shasha. FinTime: A financial time series
benchmark. SIGMOD Record, 28:42–48, 1999.
[9] Hoang Anh Dau, Eamonn Keogh, Kaveh Kamgar, Chin-Chia Michael Yeh,
Yan Zhu, Shaghayegh Gharghabi, Chotirat Ann Ratanamahatana, Yanping,
Bing Hu, Nurjahan Begum, Anthony Bagnall, Abdullah Mueen, and Gustavo
55
Batista. The UCR Time Series Classification Archive. October 2018. https:
//www.cs.ucr.edu/∼eamonn/time series data 2018/.
[10] Dheeru Dua and Casey Graff. UCI Machine Learning Repository. http://
archive.ics.uci.edu/ml, 2017.
[11] R.J. Hyndman. Time Series Data Library. https://datamarket.com/data/list/
?q=provider:tsdl.
[12] Time-series data on data.world: 34 datasets. https://data.world/datasets/
time-series.
[13] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford Large Network
Dataset Collection. June 2014.
[14] Davis Blalock, Samuel Madden, and John Guttag. Sprintz: Time Series Com-
pression for the Internet of Things. Proc. ACM Interact. Mob. Wearable Ubiq-
uitous Technol., 2(3):93:1–93:23, September 2018.
[15] Robert Allen. Case Study: How Houghton Mifflin Harcourt gets real-time
views into their AWS spend with InfluxData, October 2017.
[16] Adam Wegrzynek. Towards the integrated ALICE Online-Offline monitor-
ing subsystem. https://indico.cern.ch/event/587955/contributions/2937431/
attachments/1678739/2706702/CHEP-2018.pdf, September 2018.
[17] Mario Luca Bernardi, Marta Cimitile, Fabio Martinelli, and Francesco Mer-
caldo. A Time Series Classification Approach to Game Bot Detection. In
Proceedings of the 7th International Conference on Web Intelligence, Mining
and Semantics, WIMS ’17, pages 6:1–6:11, New York, NY, USA, 2017. ACM.
[18] Yanpei Chen, Francois Raab, and Randy Katz. From TPC-C to Big Data
Benchmarks: A Functional Workload Model. In David Hutchison, Takeo
Kanade, Josef Kittler, Jon M. Kleinberg, Friedemann Mattern, John C.
Mitchell, Moni Naor, Oscar Nierstrasz, C. Pandu Rangan, Bernhard Steffen,
Madhu Sudan, Demetri Terzopoulos, Doug Tygar, Moshe Y. Vardi, Gerhard
Weikum, Tilmann Rabl, Meikel Poess, Chaitanya Baru, and Hans-Arno Ja-
cobsen, editors, Specifying Big Data Benchmarks, volume 8163, pages 28–43.
Springer Berlin Heidelberg, Berlin, Heidelberg, 2014.
56
[19] DB-Engines Ranking per database model category. https://db-engines.com/
en/ranking categories.
[20] Y C Tay. Data Generation for Application-Specific Benchmarking. VLDB,
Challenges and Visions, 7:4, 2011.
[21] Zhang, Xiaolan and Seltzer, and Margo. Application-Specific Benchmarking.
Harvard University, 2001.
[22] Ajay Joshi, Lieven Eeckhout, and Lizy John. The Return of Synthetic Bench-
marks. In 2008 SPEC Benchmark Workshop, pages 1–11, 2008.
[23] A. Chakraborti, M. Patriarca, and M. S. Santhanam. Financial time-series
analysis: A brief overview. arXiv:0704.1738 [physics, q-fin], pages 51–67,
2007.
[24] Boris Marjanovic. Huge Stock Market Dataset. https://kaggle.com/
borismarjanovic/price-volume-data-for-all-us-stocks-etfs.
[25] F. Maxwell Harper and Joseph A. Konstan. The MovieLens Datasets: History
and Context. ACM Trans. Interact. Intell. Syst., 5(4):19:1–19:19, December
2015.
[26] Code for comparison write ups of InfluxDB and other solutions:
Influxdata/influxdb-comparisons. InfluxData, May 2019.
[27] Tomas Senart. HTTP load testing tool and library. tsenart/vegeta. https:
//github.com/tsenart/vegeta, May 2019.
[28] Jef Poskanzer. Http load. https://acme.com/software/http load/.
[29] NoSQL data store using the seastar framework, compatible with Apache Cas-
sandra: Scylladb/scylla. https://github.com/scylladb/scylla, May 2019.
LIST OF ABBREVIATIONS 57
List of Abbreviations
ACID Atomicity, Consistency, Isolation, Durability
API Application Programming Interface
ARIMA AutoRegressive Integrated Moving Average
CAP Consistency, Availability and Partition Tolerance
CERN European Organization for Nuclear Research
CPU Central Processing Unit
CRUD Create, Read, Update and Delete
CSV Comma-separated values
CTSDB Cloud Time Series Database
DBMS Database management system
ETF Exchange-Traded Fund
HTTP HyperText Transfer Protocol
IBDb Internet Movie Database
IEEE Institute of Electrical and Electronics Engineers
IoT Internet of Things
JSON JavaScript Object Notation
KPI Key Performance Indicator
MIT Massachusetts Institute of Technology
NoSQL Not Only SQL
OLAP Online Analytical Processing
58
RAM Random Acces Memory
RDBMS Relational Database Management System
REST Representational State Transfer
SNAP Stanford Network Analysis Project
SQL Structured Query Language
STAC Securities Technology Analysis Center
TPC Transaction Processing Performance Council
TS Time Series
TSDB Time Series Database
TSDL Time Series Data Library
UCI University of California, Irvine
UDP User Datagram Protocol
URL Uniform Resource Locator
USD United States Dollar
YCSB Yahoo! Cloud Serving Benchmark
LIST OF FIGURES 59
List of Figures
5.1 Storage efficiency of different TSDBs in bytes per data point. Data
points contain a timestamp, a value, and may contain tags, depend-
ing on the data set. . . . . . . . . . . . . . . . . . . . . . . . . . . . 37
5.2 Relative storage efficiency of different TSDBs per data point com-
pared to the CSV source format. . . . . . . . . . . . . . . . . . . . 37
5.3 Data points ingested per second. Data sets used were one million
data points each. . . . . . . . . . . . . . . . . . . . . . . . . . . . . 41
5.4 Maximum requests per second. Tests were performed on data sets
one million data points in size. . . . . . . . . . . . . . . . . . . . . . 42
5.5 Mean latency per request. . . . . . . . . . . . . . . . . . . . . . . . 44
5.6 95th percentile of latency per request. . . . . . . . . . . . . . . . . . 44
5.7 Mean size in bytes of the TSDB response. . . . . . . . . . . . . . . 46
LIST OF TABLES 60
List of Tables
3.1 Evaluation of existing TSDB benchmarks . . . . . . . . . . . . . . . 20
4.1 Overview of workload data sets . . . . . . . . . . . . . . . . . . . . 30
A.1 Data ingestion speed in point per second. . . . . . . . . . . . . . . . 51
A.2 Storage efficiency in bytes per data point. . . . . . . . . . . . . . . 52
A.3 Maximum requests per second performed using representative queries. 52
A.4 TSDB mean request latency size for representative queries. . . . . . 53
A.5 TSDB 95th percentile request latency size for representative queries. 53
A.6 Number of timeouts during the latency and response size tests. . . . 53
A.7 TSDB mean response size for representative queries. . . . . . . . . . 53