cassandra consistency

CassandraConsistency

Quick Overview

Token/DHT

Consistent Hashing

Replication Factor(RF)

Consistency Level(CL)

Hinted Handoff(HH)

A hint is written to the coordinator node when a replica is down

Read Repair(RR)

Background digest query on-read to find and update out-of-date replicas*

* carried out in the background unless CL:ALL

http://www.planetcassandra.org/data-replication-in-nosql-databases-explained/#

更新(insert,update,delete)

http://www.planetcassandra.org/data-replication-in-nosql-databases-explained/#

https://uberdev.wordpress.com/2015/11/29/cassandra-developer-certification-study-notes-read-path/

https://uberdev.wordpress.com/2015/11/29/cassandra-developer-certification-study-notes-read-path/

Write Path

SSTable是不可变的，当Memtable刷写到磁盘后就不能继续写⼊入，同⼀一个Partition可能跨越多个SSTable，但是不可能跨越多个节点

Partition/Primary Index：Partition keys以及在Data File⽂文件中这⼀一⾏行的起始位置（数据的元数据，索引） Partition/Index Summary：Partition Index的抽样信息，保存在内存中（元数据的元数据，索引的索引） Bloom Filter：检查⼀一⾏行数据（Partition Key）是否在SSTable中，如果不再，就不会读取SSTable

http://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlHowDataWritten.html

http://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlHowDataWritten.html

Read Path

①

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutReads.html http://www.datastax.com/dev/blog/maximizing-cache-benefit-with-cassandra

Memtable RowCacheN

Y

②③

④

⑤

⑤

a pk is found in key cache

⑥

⑦

Read Request Flow

Row cache & Key cache

The row cache is not write-through. If a write comes in for the row, the cache for that row is invalidated and is not cached again until the row is read. Similarly, if a partition is updated, the entire partition is evicted from the cache. When the desired partition data is not found in the row cache, then the Bloom filter is checked.

RowCache是不可写的，如果更新了⼀一⾏行，则在RowCache中的这⼀一⾏行就彻底失效了：会从RowCache中移除直到下次访问这⼀一⾏行时

A Bloom filter can establish that a SSTable does not contain certain partition data. A Bloom filter can also find the likelihood that partition data is stored in a SSTable. However, because the Bloom filter is a probabilistic function, it can result in false positives. Not all SSTables identified by the Bloom filter will have data. If the Bloom filter does not rule out an SSTable, Cassandra checks the partition key cache

The partition key cache stores a cache of the partition index off-heap. If a partition key is found in the key cache can go directly to the compression offset map to find the compressed block on disk that has the data.

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutReads.html

http://www.datastax.com/dev/blog/maximizing-cache-benefit-with-cassandra

https://2012.nosql-matters.org/cgn/wp-content/uploads/2012/06/Sylvain_Lebresne-Cassandra_Storage_Engine.pdf

Write & Read Example

https://2012.nosql-matters.org/cgn/wp-content/uploads/2012/06/Sylvain_Lebresne-Cassandra_Storage_Engine.pdf

Compaction

SSTable StorageFormat

Storage

http://distributeddatastore.blogspot.com/2013/08/cassandra-sstable-storage-format.html

Index.db

Data.db

索引⽂文件存储的是所有的Key(不采样)⽽而MD5表数据的KeyValue⼤大⼩小均匀，所以索引⽂文件和数据⽂文件⼤大⼩小差不多

Regular Column Tombstone Column

http://distributeddatastore.blogspot.com/2013/08/cassandra-sstable-storage-format.html

Full Index & Sample Index

Index.dbSummary.db

1. Row key length (short/2 bytes) 2. Key (N bytes) 3. Offset in SSTable data file (long/8 bytes) 4. Promoted size (int/4 bytes)

00000000 00 04 72 6f 77 41 00 00 00 00 00 00 00 00 00 00 |..rowA..........| 00000010 00 00 00 04 72 6f 77 42 00 00 00 00 00 00 00 5f |....rowB......._| 00000020 00 00 00 00 00 0a 72 6f 77 45 78 63 6c 75 64 65 |......rowExclude| 00000030 00 00 00 00 00 00 00 be 00 00 00 00 |............| 0000003c

Failure，Error Handling

http://www.datastax.com/dev/blog/cassandra-error-handling-done-right

http://www.datastax.com/dev/blog/cassandra-error-handling-done-right

http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure

When a timeout is not a failure

http://www.datastax.com/dev/blog/how-cassandra-deals-with-replica-failure

Rapid Read Protection(speculative_retry/dynamic snitch)

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlClientRequestsRead.html http://www.planetcassandra.org/blog/rapid-read-protection-in-cassandra-202/ https://issues.apache.org/jira/browse/CASSANDRA-5932

1.客户端向Coordinator节点请求数据，协调节点将请求路由到性能最好的节点(副本)，最后将结果返回给客户端

只针对读。读只会请求⼀一个节点的副本，然后根据⼀一致性级别和ReadRepair概率，只会请求其他副本的Checksum(没有请求数据)：选择⼀一个最适合的副本很重要。 DynamicSnitch会监测不同副本的读取性能，基于历史选择最好的那个副本。

ALTER TABLE users WITH speculative_retry = '10ms'; ALTER TABLE users WITH speculative_retry = '99percentile';

优点：某些节点性能差时可以降低读延迟缺点：产⽣生额外的请求，吞吐量下降

注意：1）不适⽤用于⼀一致性级别=ALL，因为该级别本⾝身就需要读取所有副本2）集群规模较⼩小时，快速读保护也会降低吞吐量，规模较⼤大时不明显

Recovering from replica node failure with rapid read protection

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlClientRequestsRead.html

http://www.planetcassandra.org/blog/rapid-read-protection-in-cassandra-202/

https://issues.apache.org/jira/browse/CASSANDRA-5932

2.如果路由到的节点在返回响应给协调节点之前失败了，客户端的请求最终会超时

3.快速读保护: 允许协调者监测未完成的请求，当原始副本的读取请求响应⽐比预期的要慢时，协调者发送额外的请求给其他副本所在的节点

✅🙅

🙅

凡事不能绝对，都不开启推测执⾏行不好，总是开启也不是好主意只对90%的请求开启推测执⾏行，这样只有10%的请求不会被保护

Data Consistency 数据⼀一致性

Paxos consensus protocolLightweight Transaction(CAS)two-phase commit

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutDataConsistency.html

Linearizable consistency

Tunable Consistency可调节的⼀一致性： R：the consistency level of read operations W: the consistency level of write operations N：the number of replicas

Strong consistency guaranteed： R + W > N Eventual consistency occured：R + W <= N

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlAboutDataConsistency.html

Client read or write requests can go to any node in the cluster because all nodes in Cassandra are peers(对等). When a client connects to a node and issues a read or write request, that node serves as the coordinator for that particular client operation.

The job of the coordinator is to act as a proxy between the client application and the nodes (or replicas) that own the data being requested. The coordinator determines which nodes in the ring should get the request based on the cluster configured partitioner and replica placement strategy.

https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/

Coordinator

https://www.datadoghq.com/blog/how-to-monitor-cassandra-performance-metrics/

Consistency refers to how up-to-date and synchronized a row of Cassandra data is on all of its replicas. Using repair operations, Cassandra data will eventually be consistent in all replicas. Repairs work to decrease the variability in replica data, but at a given time, stale data can be present.

The consistency level determines the number of replicas that need to acknowledge the read or write operation success to the client application. For read operations, the read consistency level specifies how many replicas must respond to a read request before returning data to the client application. For write operations, the write consistency level specified how many replicas must respond to a write request before the write is considered successful.

Even at low consistency levels, Cassandra writes to all replicas of the partition key, including replicas in other data centers. The write consistency level just specifies when the coordinator can report to the client application that the write operation is considered completed.

If a read operation reveals(揭⽰示) inconsistency among replicas, Cassandra initiates(启动) a read repair to update the inconsistent data. Write operations will use hinted handoffs to ensure the writes are completed when replicas are down or otherwise not responsive to the write request.

Typically, a client specifies a consistency level that is less than the replication factor specified by the keyspace. Another common practice is to write at a consistency level of QUORUM and read at a consistency level of QUORUM. The choices made depend on the client application's needs, and Cassandra provides maximum flexibility for application design. There is a tradeoff between operation latency and consistency: higher consistency incurs higher latency, lower consistency permits lower latency. You can control latency by tuning consistency.

Consistency Level(CL): How many replicas must respond to declare success? Hinted Handoff(HH): A hint is written to the coordinator node when a replica is down Read Repair(RR): Background digest query on-read to find and update out-of-date replicas

https://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlAboutDataConsistency.html

Consistency Level

https://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlAboutDataConsistency.html

Client

Direct Read

Direct Read

Digest Read

Compare In Memory Decide Which Latest

What If n4 newer than n3, issure another Direct Read to n4? (Because n4 is just digest, for full data, we need Direct Read) In this situation, n3 will also pull data from newer data at n4.❓

虽然副本存储在n2,n3,n4，⽽而且n2可以认为是主副本但是协调节点会根据历史数据选择最快那个节点的副本

CL=ONE?

读取负载最低的节点的数据(如果它不是最新的呢) 两两⽐比较，还是Direct Read和Digest Read⽐比较?

当CL=ONE时read_repair_chance配置有效:只有10%的请求需要进⾏行Read Repair. chance对CL>ONE⽆无效,即CL=QUORUM/ALL，所有请求⼀一旦不⼀一致都需要Repair

read_repair_chance is ignored if the ConsistencyLevel is greater than ONE and read repair always occurs.

Write=ALL, READ=ONE, 保证了强⼀一致性，同时只有10%的请求才会在后台启动Read Repair

Read repair means that when a query is made against a given key, we perform a digest query against all the replicas of the key and push the most recent version to any out-of-date replicas. If a lower ConsistencyLevel than ALL was specified, this is done in the background after returning the data from the closest replica to the client; otherwise(CL=ALL), it is done before returning the data. This means that in almost all cases, at most the first instance of a query will return old data(第⼀一次可能会收到过期的数据，但是后续相同的查询因为修复过数据就是新的). Read Repair机制：查询时先向最近的节点查询数据[1]，然后向其他节点发送Digest请求，在对所有的副本进⾏行⽐比较后将最新时间撮的副本数据推送到其他过期的副本。不同的⼀一致性级别只是Read Repair的时机不同，ONE或QUORUM时，在将最近那个节点的数据[1]返回给客户端之后才在后台开始ReadRepair操作。当⼀一致性级别=ALL，在返回数据给客户端前完成ReadRepair。不管哪种⼀一致性，请求完整的数据只会是最近的那个节点，即使这个节点的数据不是最新的，最终还是会返回给客户端，就有可能返回过期数据

https://wiki.apache.org/cassandra/ReadRepair https://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlClientRequestsRead.html

http://www.datastax.com/dev/blog/common-mistakes-and-misconceptions

There are three types of read requests that a coordinator can send to a replica: + A direct read request + A digest request + A background read repair request

The coordinator node contacts one replica node with a direct read request. Then the coordinator sends a digest request to a number of replicas determined by the consistency level specified by the client. The digest request checks the data in the replica node to make sure it is up to date. Then the coordinator sends a digest request to all remaining replicas. If any replica nodes have out of date data, a background read repair request is sent. Read repair requests ensure that the requested row is made consistent on all replicas.

For a digest request the coordinator first contacts the replicas specified by the consistency level. The coordinator sends these requests to the replicas that are currently responding the fastest. The nodes contacted respond with a digest of the requested data; if multiple nodes are contacted, the rows from each replica are compared in memory to see if they are consistent. If they are not, then the replica that has the most recent data (based on the timestamp) is used by the coordinator to forward the result back to the client. To ensure that all replicas have the most recent version of the data, read repair is carried out to update out-of-date replicas. CL=ONE，Direct Read⼀一个节点，但只有10%的请求会在后台发⽣生Read Repair（剩余的两个副本） CL=QUORUM，Direct Read⼀一个节点，向另⼀一个节点发送Digest Read，此次满⾜足QUORUM级别，确保这两个节点数据⼀一致后返回Direct Read读取的数据给客户端，再次向最后⼀一个节点发送Digest Read（如果最后这个节点才是最新的数据呢？） CL=ALL，Direct Read⼀一个节点，向另外两个节点发送Digest Read，运⾏行Read Repair确保所有节点数据⼀一致，返回Direct Read数据给客户端

Read & Read Repair

Read repair is not directly related to repair, but both play a role in the overall anti-entropy system in Cassandra. read_repair_chance setting used to be started out as 1. That is, at a consistency level of 1, for every read, we would check the other replicas to see if the thing data we just read is consistent with the other replicas. This was good, because if you ever read stale data, the next time you read the same row you would probably read something more up to date. The bad part about this was requiring every read to become RF reads (and typically your RF is set to at least 3). Meaning that reads happen more often, and require more IO. In newer versions of Cassandra the default for this value is 0.1, and it is set on a per-columnfamily basis. Which means 10% of your requests will trigger a background read repair. This is more than enough for typical scenarios.

https://wiki.apache.org/cassandra/ReadRepair

https://docs.datastax.com/en/cassandra/2.2/cassandra/dml/dmlClientRequestsRead.html

http://www.datastax.com/dev/blog/common-mistakes-and-misconceptions

When data is read to satisfy a query and return a result, all replicas are queried for the data needed(所有的副本都会被查询). The first replica node receives a direct read request and supplies the full data(第⼀一个副本收到Direct Read请求，提供完整的数据给协调节点). The other nodes contacted receive a digest request and return a digest, or hash of the data(其他节点收到Digest请求，返回数据的概要给协调节点). A digest is requested because generally the hash is smaller than the data itself.

A comparison of the digests allows the coordinator to return the most up-to-date data to the query(对概要进⾏行⽐比较, 这样允许协调者返回最新的数据给客户端, 问题：概要能直接返回给客户端吗？如果Direct Read不是最新的怎么办？概要可以和Direct Read⽐比较吗？). If the digests are the same for enough replicas to meet the consistency level, the data is returned(概要的数量满⾜足⼀一致性级别，数据返回给客户端). If the consistency level of the read query is ALL, the comparison must be completed before the results are returned; otherwise for all lower consistency levels, it is done in the background(⼀一致性级别为ALL，⽐比较操作必须在返回结果给客户端之前完成，否则可以在返回结果后⽐比较).

The coordinator compares the digests, and if a mismatch is discovered(发现了不⼀一致), a request for the full data is sent to the mismatched nodes(完整的数据会被发送到不匹配的节点，这个完整的数据是Direct Read的吗，还是Digest中时间撮最新的？). The most current data found in a full data comparison is used to reconcile(调解) any inconsistent data on other replicas.

http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesTOC.html http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesReadRepair.html

Node repair makes data on a replica consistent with data on other nodes and is important for every Cassandra cluster. Repair is the process of correcting the inconsistencies so that eventually, all nodes have the same and most up-to-date data.

Repair can occur in the following ways: ✅ Hinted Handoff During the write path, if a node that should receive data is unavailable, hints are written to the coordinator. When the node comes back online, the coordinator can hand off the hints so that the node can catch up and write the data.

✅ Read Repair During the read path, a query acquires data from several nodes. The acquired data from each node is checked against each other node. If a node has outdated data, the most recent data is written back to the node.

✅ Anti-Entropy Repair For maintenance purposes or recovery, manually run anti-entropy repair to rectify inconsistencies on any nodes(by nodetool repair).

Repair

http://docs.datastax.com/en/cassandra/2.2/cassandra/operations/opsRepairNodesReadRepair.html

Hint TTL, max_hint_window_in_ms=3hour如果⼀一个节点当掉超过3⼩小时，后续的hint不会存储

可调节的⼀一致性

Low Latency，Low Consistency 低的⼀一致性才能有低的延迟 High Latency，High Consistency ⾼高的⼀一致性会产⽣生⾼高的延迟

ReadWrite

Consistency Example

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlClientRequestsWrite.html

The coordinator sends a write request to all replicas that own the row being written. As long as all replica nodes are up and available, they will get the write regardless of the consistency level specified by the client. The write consistency level determines how many replica nodes must respond with a success acknowledgment in order for the write to be considered successful. Success means that the data was written to the commit log and the memtable as described in how data is written.

In a single data center 12 node cluster with a replication factor of 3, an incoming write will go to all 3 nodes that own the requested row. If the write consistency level specified by the client is ONE, the first node [R1] to complete the write responds back to the coordinator, which then proxies the success message back to the client [write response]. A consistency level of ONE means that it is possible that 2 of the 3 replicas [R2,R3] could miss the write if they happened to be down at the time the request was made.

That node [coordinator] forwards the write to all replicas of that row. It responds to the client once it receives write acknowledgments from the number of nodes specified by the consistency level. 1. If the coordinator cannot write to enough replicas to meet the requested CL, it throws an Unavailable Exception and does not perform any writes. 2. If there are enough replicas available but the required writes don't finish within the timeout window, the coordinator throws a Timeout Exception.

写⼀一致性

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlClientRequestsWrite.html

DC:2, RF:3, CL:QUORUM=> 所有数据中⼼心，两个副本

In multiple data center deployments, Cassandra optimizes write performance by choosing one coordinator node. The coordinator node contacted by the client application forwards the write request to each replica node in each all the data centers.

If using a consistency level of LOCAL_ONE or LOCAL_QUORUM, only the nodes in the same data center as the coordinator node must respond to the client request in order for the request to succeed. This way, geographical latency does not impact client request response times.

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlClientRequestsReadExp.html

DC:1, RF:3, CL:QUORUM=>2

In a single data center cluster with a replication factor of 3, and a read consistency level of QUORUM, 2 of the 3 replicas for the given row must respond to fulfill the read request. If the contacted replicas have different versions of the row, the replica with the most recent version will return the requested data [to Client]. In the background, the third replica is checked for consistency with the first two, and if needed, a read repair is initiated for the out-of-date replicas.

读⼀一致性

https://docs.datastax.com/en/cassandra/3.x/cassandra/dml/dmlClientRequestsReadExp.html

DC:1, RF:3, CL:ONE=>1

In a single data center cluster with a replication factor of 3, and a read consistency level of ONE, the closest replica for the given row is contacted to fulfill the read request. In the background a read repair is potentially initiated, based on the read_repair_chance setting of the table, for the other replicas.

In a two data center cluster with a RF=3, and a read consistency of QUORUM, 4 replicas for the given row must respond to fulfill the read request. The 4 replicas can be from any data center. In the background, the remaining replicas are checked for consistency with the first four, and if needed, a read repair is initiated for the out-of-date replicas.

DC:2, RF:3, CL:QUORUM=> 任何数据中⼼心，四个副本

DC:2, RF:3, CL:LOCAL_QUORUM=> 本地数据中⼼心，两个副本

In a multiple data center cluster with a RF=3, and a read consistency of LOCAL_QUORUM, 2 replicas in the same DC as the coordinator node for the given row must respond to fulfill the read request. In the background, the remaining replicas are checked for consistency with the first 2, and if needed, a read repair is initiated for the out-of-date replicas.

DC:2, RF:3, CL:ONE=> 任何DC，⼀一个副本

In a multiple data center cluster with a RF=3, and a read consistency of ONE, the closest replica for the given row, regardless of data center, is contacted to fulfill the read request. In the background a read repair is potentially initiated, based on the read_repair_chance setting of the table, for the other replicas.

DC:2, RF:3, CL:LOCAL_ONE=> 本地数据中⼼心，⼀一个副本

In a multiple data center cluster with a RF=3, and a read consistency of LOCAL_ONE, the closest replica for the given row in the same data center as the coordinator node is contacted to fulfill the read request. In the background a read repair is potentially initiated, based on the read_repair_chance setting of the table, for the other replicas.

Bloom Filter

sstable sstablekey1

Bloom Filter

Bloom Filter

sstable sstable

Bloom Filter

Bloom Filterkey1

Am I Here?

Query key1

sstable sstable

Bloom Filter

Bloom Filter

No,U’r NOT here!

sstable sstable

Bloom Filter

Bloom Filter

OK, I Believe U!

key1

key1

GO NEXT SSTABLE…

sstable sstable

Bloom Filter

Bloom Filter

bloom_filter_fp_chancefalse positive

determines the percent chance of the bloom filter returning a false positive that a partition exists in an SSTable when in fact it does not.

false positives are possible; false negatives are not possible

If you increase the percent chance of false positives, then you lower memory usage via a smaller filter size at the expense of more disk seeks due to an increase in false positives.

If you decrease the percent chance of false positives, then you increase memory usage via a larger filter size for the benefit of fewer disk seeks thanks to fewer false positives.

https://grockdoc.com/cassandra/2.1/articles/tuning-reads-via-the-bloom-filter_88c8f57a-71d0-41ee-b77f-617c64ad4739/http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html

False positive matches are possible, but false negatives are not. In other words, a query returns either “possibly in set” or “definitely not in set”.

https://grockdoc.com/cassandra/2.1/articles/tuning-reads-via-the-bloom-filter_88c8f57a-71d0-41ee-b77f-617c64ad4739/

http://docs.datastax.com/en/cql/3.1/cql/cql_reference/compactSubprop.html

http://www.datastax.com/dev/blog/improving-compaction-in-cassandra-with-cardinality-estimation

http://www.datastax.com/dev/blog/improving-compaction-in-cassandra-with-cardinality-estimation

Merkle Tree

https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesManualRepair.html

https://docs.datastax.com/en/cassandra/3.0/cassandra/operations/opsRepairNodesManualRepair.html

http://www.datastax.com/dev/blog/more-efficient-repairs

http://www.datastax.com/dev/blog/more-efficient-repairs

JAVA Driver

http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html

http://christopher-batey.blogspot.com/2015/02/cassandra-anti-pattern-misuse-of.html

https://www.pythian.com/blog/guide-to-cassandra-thread-pools/

https://www.pythian.com/blog/guide-to-cassandra-thread-pools/

cassandra consistency

Technology