evaluating nosql performance: which database is right for your data? - sergey sverchkov (altoros)
DESCRIPTION
Presented at JAX London 2013 The need to operate terabyte-size databases becomes very common these days. Unless you have implemented architectures that use NoSQL databases and frameworks that support data-intensive distributed applications, then many technology options available are probably a slight enigma. This session focuses on real-world successful attempts to benchmark four of the most popular NoSQL databases side by side. The base tool selected for the purpose of this research is Yahoo Cloud Serving Benchmark and benchmarking is performed on Amazon Elastic Compute Cloud instances.TRANSCRIPT
© ALTOROS Systems | CONFIDENTIAL
Evaluating NoSQL Performance:
Which Database is Right for Your
Data
Sergey SverchkovProject Manager
© ALTOROS Systems | CONFIDENTIAL 2
Relational databases are great… But
Problem: Complex Object graphs
Object/Relational impedance mismatch
It is complicated to map rich domain model
to a relational schema
Performance issues
Problem: Schema evolution
Adding attributes to an object
=> have to add columns to table
Expensive, if there is lots of data in that table
Problem: Scaling
Scaling writes difficult/expensive/impossible => big data
Vertical scaling is limited and is expensive
Horizontal scaling is limited and is expensive
Relational Databases
ORDER
ADDRESS
CUSTOMER
ORDER_LINES
Order
ID: 1001Order Date: 15.9.2012
Line Items
Customer
First Name: PeterLast Name: Sample
Billing Address
Street: Somestreet 10City: SomewherePostal Code: 55901
Name
Ipod Touch
Monster Beat
Apple Mouse
Quantity
1
2
1
Price
220.95
190.00
69.90
© ALTOROS Systems | CONFIDENTIAL 3
Why evaluate
• There is a big variety of NoSQL databases: 150+ in 2013
• Different NoSQL database types exist: key-value, columnar,
document, and graph
• NoSQL DBs don’t use the relational data model and don’t use SQL
• They are schema-free, with a flexible data model
• They have different APIs
• Some NoSQL data stores support certain SQL notions
• They operate with eventual consistency
• NoSQL DBs tend to be designed to run on a cluster
• They support horizontal scaling (scaling out)
Overview of the NoSQL ecosystem:
© ALTOROS Systems | CONFIDENTIAL 4
Evaluation criteria:
• Data model: key-value, document, column family, or graph
• Query possibilities: REST API, query language, or Map / Reduce
support
• Concurrency control: optimistic locking or multi-version concurrency
control
• Partitioning: range or hash
• Consistency and replication: availability or consistency
• Performance: typical workloads
How to evaluate NoSQL data stores
© ALTOROS Systems | CONFIDENTIAL 5
Performance evaluation approach: definitions
• Yahoo Cloud Serving Benchmark
a framework with a workload generator
a set of workload scenarios
• Workload is defined by different distributions
which operation to perform
which record to read or write
• Operations of the following types:
Insert: Inserts a new record.
Update: Updates a record by replacing the value of one field.
Read: Reads a record, either one randomly selected field, or all fields.
Scan: Scans records in order, starting from a randomly selected record key.
How to evaluate NoSQL data stores
© ALTOROS Systems | CONFIDENTIAL 6
Performance evaluation approach - definitions
• Table of 100,000,000 records
Each record is 1,000 bytes in size and contains 10 fields
Fields are named field0, field1, .. Field10
Primary key identifies each record, such as “user234123”
Values in each field are random strings of ASCII characters, 100 bytes each
• Workload executor
multiple client threads
sequential series of operations
the load phase
the transaction phase
How to evaluate NoSQL data stores
© ALTOROS Systems | CONFIDENTIAL 7
Performance evaluation approach – component diagram
How to evaluate NoSQL data stores
© ALTOROS Systems | CONFIDENTIAL 8
Testing environment diagram
Where to evaluate
© ALTOROS Systems | CONFIDENTIAL 9
Performance evaluation – environment specification
• Amazon AWS EC2 instances:
Single availability zone eu-west-1b, Ireland region
Single security group with all required port opened
4 m1.xlarge 64bit instances for cluster nodes: 16GB RAM, 4 vCPU, 8 ECU, high-
performance network
1 c1.xlarge 64bit instance for YSCB client: 7GB RAM, 8 vCPU, 20 ECU, high-
performance network
2 additional c1.medium 64bit instances for mongo routers: 1.7GB RAM, 2 vCPU, 5
ECU, moderate network
• Storage for each NoSQL cluster node:
4 EBS volumes by 25 GB each in RAID0
EBS optimized volumes, no Provisioned IOPS
Where to evaluate
© ALTOROS Systems | CONFIDENTIAL 10
Databases to evaluate
• Cassandra 2.0, settings for each cluster node
partitioner: org.apache.cassandra.dht.Murmur3Partitioner
key_cache_size_in_mb: 1024
row_cache_size_in_mb: 6096
JVM heap size: 6GB
Snappy compressor
Replica factor 1
• MongoDB 2.4.6
2 c1.medium nodes with mongo router process - mongos
Replica factor 1
Sharding by internal key “_id”
Databases to evaluate
© ALTOROS Systems | CONFIDENTIAL 11
Databases to evaluate
• Couchbase 2.1
Replica factor 1
Memory + disk mode
• Hbase 0.92, settings for HRegionServer
JVM heap size 12GB
Replica factor 1
Snappy compressor
Databases to evaluate
© ALTOROS Systems | CONFIDENTIAL 12
Workloads
Performance of the systems was evaluated under different workloads:
Workload A: Update heavily - Read/update ratio: 50/50
Workload B: Read mostly - 95/5 read/update
Workload C: Read only – 100 read
Workload D: Read latest – read / insert ratio 95/5
Workload F: Read-modify-write - read-modify-write/read in a proportion of
50/50
Workload G: Write heavily - 10/90 read/insert ratio.
Workload definition parameters:
fieldcount=10 fieldlength=100
threadcount=100 operationcount=10000
recordcount=100000000
workload=com.yahoo.ycsb.workloads.CoreWorkload
Workloads
© ALTOROS Systems | CONFIDENTIAL 13
Load phase, average latency vs. throughput
Load phase
10000 15000 20000 25000 30000 350000
1
2
3
4
5
6
7
8
9
Load phase, 100.000.000 records * 1 KB, [INSERT]
hbasecassandracouchbasemongodb
Throughput, ops/sec
Aver
age
late
ncy,
ms
© ALTOROS Systems | CONFIDENTIAL 14
Workload A – 50% update operations
Workload A
0 500 1000 1500 2000 2500 30000
20
40
60
80
100
120
Workload A: Update (Update 50%, Read 50%)
cassandra
couchbase
hbase
mongodb
© ALTOROS Systems | CONFIDENTIAL 15
Workload A – 50% read operations
Workload A
0 500 1000 1500 2000 2500 30000
10
20
30
40
50
60
70 Workload A: Read (Update 50%, Read 50%)
cassandra
couch
hbase
mongo
© ALTOROS Systems | CONFIDENTIAL 16
Workload B – 5% update operations
Workload B
0 500 1000 1500 2000 25000
20
40
60
80
100
120
Workload B: Update (update 5% , read 95%)
cassandracouchhbasemongo
© ALTOROS Systems | CONFIDENTIAL 17
Workload B – 95% read operations
Workload B
0 500 1000 1500 2000 25000
10
20
30
40
50
60
70
80
90
Workload B: Read (update 5% , read 95%)
cassandracouchhbasemongo
© ALTOROS Systems | CONFIDENTIAL 18
Workload C – 100% read operations
Workload C
0 500 1000 1500 2000 2500 30000
10
20
30
40
50
60
70
80
Workload C: 100% Read
cassandracouchhbasemongo
© ALTOROS Systems | CONFIDENTIAL 19
Workload D – 5% insert operations
Workload D
0 500 1000 1500 2000 2500 30000
10
20
30
40
50
60
Workload D: Insert (insert 5% , read 95%)
cassandracouchhbasemongo
© ALTOROS Systems | CONFIDENTIAL 20
Workload D – 95% read operations
Workload D
0 500 1000 1500 2000 2500 30000
10
20
30
40
50
60
70
80
90
Workload D: Read (insert 5% , read 95%)
cassandracouchhbasemongo
© ALTOROS Systems | CONFIDENTIAL 21
Workload E – 95% scan operations
Workload E
0 50 100 150 200 2500
50
100
150
200
250
300
350
400Workload E: Insert (Insert 5%, Scan 95%)
cassandra
hbase
© ALTOROS Systems | CONFIDENTIAL 22
Workload F – 50% Read operations
Workload F
0 500 1000 1500 2000 25000
10
20
30
40
50
60
70
80
Workload F: read (Read-Modify-Write 50%, Read 50%)
cassandracouchhbasemongo
© ALTOROS Systems | CONFIDENTIAL 23
Workload F – Update part of Read-Modify-Write
Workload F
0 500 1000 1500 2000 25000
20
40
60
80
100
120
140
Workload F: Update (Read-Modify-Write 50%, Read 50%)
cassandracouchhbasemongo
© ALTOROS Systems | CONFIDENTIAL 24
Workload F – 50% Read-Modify-Write operations
Workload F
0 500 1000 1500 2000 25000
20
40
60
80
100
120
140
160
180
200
Workload F: Read-Modify-Write (Read-Modify-Write 50%, Read 50%)
cassandracouchhbasemongo
© ALTOROS Systems | CONFIDENTIAL 25
Workload G – 90% Insert operations
Workload G
0 1000 2000 3000 4000 5000 6000 70000
5
10
15
20
25
30
35
Workload G: Insert (Insert 90%, Read 10%)
cassandracouchhbasemongo
© ALTOROS Systems | CONFIDENTIAL 26
Workload G – 10% Read operations
Workload G
0 1000 2000 3000 4000 5000 6000 70000
5
10
15
20
25
30
35
40
45
50
Workload G: Read (Insert 90%, Read 10%)
cassandracouchhbasemongo
© ALTOROS Systems | CONFIDENTIAL 27
Choose a solution based on your needs:
• Identify typical application operations
• Identify datasets and potential datamodel
• Identify transaction, replication and consistency requirements
• Identify performance requirements
• Identify how you can migrate, if needed
• Evaluate functionality and performance of chosen databases
• Build proof-of-concept for the solution
• No perfect NoSQL / RDBMS database and no “bad”
Conclusion
© ALTOROS Systems | CONFIDENTIAL 28
Evaluating NoSQL Performance
Sergey Sverchkov
Project Manager
Altoros, 2013
Thank you