mycassandra (full english version)
Post on 15-Jan-2015
2.080 Views
Preview:
DESCRIPTION
TRANSCRIPT
Shunsuke Nakamura / @sunsuk7tp
Tokyo Institute of Technology Master Course
Tokyo, Japan
Update latency in write-heavy workload
Read latency in read-heavy workload
Bet
ter
read-optimized
write- optimized
read-optimized
write-optimized
The storage engine determines which workload a data store treats efficiently.
The distribution architecture of a data store is independent of the performance characteristics of read and write.
For example, if the storage part is excanged with MySQL, what does the characteristics of read and write change?
performance storage engine distribution
Apache HBase write optimized Bigtable like centralized
Apache Cassandra write optimized Bigtable like decentralized
Sharded MySQL read optimized MySQL centralized
Yahoo! Sherpa read optimized MySQL centralized
What is MyCassandra?
= Dynamo + Bigtable
= Dynamo + Bigtable
distribution (P2P/decentralized) storage engine
= Dynamo +
distribution (P2P/decentralized) storage engine
= Dynamo + MySQL
Bigtable Redis
: storage engine
MyCassandra is a modular distributed data store. You can select a storage engine by a keyspace.
Index algorithm Read-optimized vs. write-optimized Sequential or Random
Volatile or persistence Your experience for the storage engine
MySQL (B+-Trees) read-optimized.
Bigtable (LSM-Tree) write-optimized. Cassandra’s original
Redis (hash) on-memory and asynchronous snapshot
MongoDB (B-Tree) schema-less document oriented db
KyotoCabinet (hash/B+-Tree) Simple Pluggable DBM (extended TokyoCabinet)
You can adapt any data store to MyCassandra, a scalable data store. • RDB (MySQL/PostgreSQL)
You can apply to the apps which change I/O characteristics by a phase. • MapReduce: Map – Shuffle - Reduce • Full text search: crowl – indexing – search
You can apply to any IaaS environments. • EC2 + RDS (MyCassandra with MySQL)
0
5000
10000
15000
20000
25000
30000
35000
40000
Write Only Write Heavy Read Heavy Read Only
Max. QPS for 40 Clients Bigtable
MySQL
Redis
(qps)
Better
select
client • o.a.c.cli • o.a.c.avro/thrift
proxy • o.a.c.service.StorageProxy
server • o.a.c.service.StorageService
• o.a.c.db.ReadVerbHandler/RowMutationVerbHandler engine
• o.a.c.db.Table (by a keyspace) o.a.c.db.commitlog o.a.c.db.ColumnFamilyStore (by a columnfamily) o.a.c.db.engine.StorageEngineInterface ← 追加 o.a.c.db.engine.MySQLInstance, RedisInstance, MongoDBInstance, …
client proxy
server
engine
Now supporting • put (key, cf) Insert/Update/Delete
• get (key) • getRangeSlice (startWith, engWith, maxResults) • truncate/dropTable/dropDB
Next supporting • secondaryIndex • expire • counter (Cassandra-0.8 ~)
At least, you implement this two method.
The Data model is the same as Cassandra. • But super column is not supported now.
Store with the same Key/Value format as SSTable • Supporting for a NoSQL of Any data model
NoSQL with a data model of smaller dimension than Cassandra • Add a prefix to a primary key • The prefix means a Keyspace/ColumnFamily name.
Cassandra MySQL Redis
keyspace database db
column family table record
column field
key visits plan
sato 18 Gold
suzuki 214 Bronze
key gender age region
sato male 17 [null]
suzuki female 21 Tokyo
Bigtable (Cassandra)
col col columnfamily A columnfamily B
keyspace
key values
sato gender;male;age;17
suzuki gender;female;age;21;region;Tokyo
table A table B key values
sato visits;18;plan;Gold
suzuki visits;214;plan;Bronze
RDB (MySQL)
key values
A:sato …
B:ito …
A:suzuki …
B:tanaka …
db
KVS (Redis)
database
A Key and a Value serialized a Object (now) ↓ # change easily A column is mapped to a MySQL’s field
• It gets smaller overhead but a schema is needed. Add specialized column
• For secondary search • For range query
rowKey CF counter secondary index
token
Primary key
Serialized object
Specialized column
For secondary search
For range search
Key Value
A heterogeneous cluster • It combines multiple types of nodes where
different storage engines are located. • Replicas of data are located each different
storage engines. • A proxy routes to nodes that efficiently process a
query.
W R
sync async
write query
Bigtable MySQL
W R
sync async
read query
Bigtable MySQL
MyCassandra Cluster keeps the same consistency strength with Cassandra.
Quorum Protocol: (write agrements) + (read afreements) > (replicas)
• This protocol guarantees to get one of the most recent value.
Our system needs one node which synchronously process both read and write queries.
→ Memory-based node (Redis)
W R RW
write read
• W: write-optimized (e.g. Bigtable) • R: read-optimized (e.g. MySQL) • RW: memory-based (e.g. Redis)
W R
sync async
write query
Bigtable MySQL
1) A proxy broadcasts the query to nodes.
2) The proxy waits 3a) write success: The proxy
returns a success msg. to client. 3b) write failure: The proxy waits
for acks from total 4) the proxy
asynchronously waits for acks from the remaining
WR
Proxy
Wait for two acks for write and return
Async write
RW
Client
Nodes responsible for a record
Write Latency: max (W, RW)
• W: write-optimized (e.g. Bigtable) • R: read-optimized (e.g. MySQL) • RW: memory-based (e.g. Redis)
=3, =2 W:RW:R = 1:1:1
1) A proxy sends a request to a R or RW node, a digest request to other replicas.
2) The proxy waits for replies including the specified record.
3a) success: if the record and digests are consistent, returns the record to the client.
3b) failure or inconsistency: The proxy tries to read and collect digests until they satisfy the quorum
4) The proxy waits from the remaining nodes after replying to the
client. If there is inconsistent, resolve it using Read Repair.
Client
Check consistency and return result
Async check consistency
Proxy
=3, =2 W:RW:R = 1:1:1
Read Latency: max (R, RW)
W R RW
Nodes responsible for a record
• W: write-optimized (e.g. Bigtable) • R: read-optimized (e.g. MySQL) • RW: memory-based (e.g. Redis)
0 2000 4000 6000 8000
10000 12000 14000 16000 18000 20000
Write-Only Write-Heavy Read-Heavy Read-Only
max. qps for 40 clients Cassandra MyCassandra Cluster
(query/sec)
Read Heavy Write Heavy
Better
× 1.54
× 6.53
× 0.93
×0.90
[100:0] [50:50] [5:95] [0:100] [write:read]
• YCSB / Zipfian • Throughput was up to 6.53 times as high as those of Cassandra. • In Write-Heavy, there happens multiple read repairs.
MyCassandra-0.2.2 • secondaryIndex Apply to MySQL and MongoDB
MyCassandra-0.3.0 • Based on Cassandra-0.8 • Atomic counter • Brisk (Hadoop + Cassandra)…
1. Asynchronous deletion 2. Engine failure detection 3. Support for ad hoc query
Cassandra’s delete/expire operation • Logical deletion using tombstone • Actual deletion with SSTable compaction → This approach depends on Bigtable’s engine.
MyCassandra (MySQL, Redis, …) • Synchronous Deletion (now) • Expire function works well, but data continues to exit. • Asynchronous deletion is a heavy operation I/O to a big table different from SSTable (It is a data subset.)
Only with storage engine failure, failure detection and the behavior of instance
With several storage engines and a partial failure, the behavior of instance
engine
instance
detect
What should I do?
instance overall failure? Take over the other node?
instance
engine
Periodic polling
instance
engine
node down
Ad hoc query and data model • If it does not depend on distributed archetecture, it can
be added easily. Data model of Redis (List, Set, ..) Document data model and ad hoc queries of MongoDB
• But if it depends, it can not be supported. Atomic query across multiple keys. Join
It is important to determine whether the query is dependent on the distributed mechanism.
github • https://github.com/sunsuk7tp/MyCassandra/
Twitter • @MyCassandraJP • @_MyCassandra # @MyCassandra had already been taken!! • @sunsuk7tp # my private account
Google Groups • https://groups.google.com/group/my-cassandra
Thank you !
top related