cassandra basic

Cassandra基础知识

所有节点在同⼀一时间具有相同的数据⼀一致性

可⽤用性

分区容忍性系统中任意信息的丢失或失败不会影响系统的继续运⾏行

保证对每个客户端请求⽆无论成功与否都有响应

C：⼀一致性(Consistency) (所有节点在同⼀一时间具有相同的数据) A：可⽤用性(Availability) (保证每个请求不管成功或者失败都有响应) P：分隔容忍(Partition tolerance) (系统中任意信息的丢失或失败不影响系统的继续运作)

CA - 满⾜足⼀一致性，可⽤用性的系统，通常在可扩展性上不太强⼤大，⽐比如单点集群。 CP - 满⾜足⼀一致性，分区容忍性的系统，通常性能不是特别⾼高。 AP - 满⾜足可⽤用性，分区容忍性的系统，通常对⼀一致性要求低⼀一些。

Hash->DHT->VNodes

1.将“tokyo”传给函数库后，客户端实现的算法就会根据“键”来决定保存数据的服务器。选定服务器后，即命令它保存“tokyo”及其值。 2.获取保存的数据，也要将要获取的键“tokyo”传递给函数库。函数库通过与数据保存时相同的算法，根据“键”选择服务器。使⽤用的算法相同，就能选中与保存时相同的服务器，然后发送get命令。只要数据没有因为某些原因被删除，就能获得保存的值。

memcached全⾯面剖析: http://charlee.li/memcached-004.html

hash分布

key的范围是0到2^32形成⼀一个环，叫做hash空间环(hash的值空间)。对集群的服务器(⽐比如ip地址)进⾏行hash，都能确定其在环空间上的位置。定位数据访问到相应服务器的算法：将数据key使⽤用相同的函数H计算出哈希值h，根据h确定此数据在环上的位置：从key在环中的位置沿着环顺时针“⾏行⾛走”，第⼀一台遇到的服务器就是其应该定位到的服务器。

1.求出服务器（节点）的哈希值，并将其配置到0〜～2^32的圆上。然后⽤用同样的⽅方法求出存储数据的键的哈希值，也映射到圆上。 2.从数据映射到的位置开始顺时针查找，将数据保存到找到的第⼀一个服务器上。如果超过2^32仍然找不到服务器，则保存到第⼀一台服务器上。

⼀一致性Hash(DHT)

⼀一致性哈希算法最⼤大限度地抑制了键的重新分布。不过使⽤用⼀一般的hash函数，服务器的映射地点的分布⾮非常不均匀。改进⽅方案：采⽤用虚拟节点为每个物理节点在圆环上分配100〜～200个点。这样就能抑制分布不均匀，最⼤大限度地减⼩小服务器增减时的缓存重新分布。

⼀一致性Hash（添加节点）添加⼀一台服务器。余数分布式算法由于保存键的服务器会发⽣生巨⼤大变化⽽而影响缓存的命中率，但⼀一致性哈希算法中，只有从环上增加服务器的地点逆时针⽅方向的第⼀一台服务器上的键会受到影响

johnny

⼀一致性Hash（⽰示例）

PartitionKey

http://docs.basho.com/riak/kv/2.1.4/learn/concepts/clusters/

由于⼀一致性哈希算法在服务节点太少时，容易因为节点分部不均匀⽽而造成数据倾斜问题，所以引⼊入了虚拟节点：把每台服务器(n台)分成v个虚拟节点，再把所有虚拟节点(n*v)随机分配到⼀一致性哈希的圆环上，这样key从在圆环上的位置顺时针往下取到的第⼀一个vnode就是⾃自⼰己的所属节点

⼀一致性Hash（vnodes）

⼀一个Ring有32个分区，集群有4个节点，每个节点有8个vnodes

key经过hash会定位到hash环上的⼀一个位置, 找到下⼀一个vnode为数据的第⼀一份存储节点. 接下来的两个vnode为另外两个副本.

vnodes & replicas

副本1副本2

副本3

http://www.littleriakbook.com/

5个节点，总共64个Partition 每个节点⼤大概有12.8个vnode

www.tom-e-white.com/2007/11/consistent-hashing.html

keynode

http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2

⼀一个节点⼀一个Token Range …=> VNodes：⼀一个节点多个不连续的⼩小的Token Range

每份数据有三个副本

16个Partition 3个Replicas 6个节点每节点=16*3/6=8

Single Token & Virtual Tokens

Token Range

greater number of smaller ranges faster than single token per node to rebuild the replacement node 重新相同数据量，多⽽而⼩小的速度要⽐比⼩小⽽而⼤大的来的快（从三个节点拷⻉贝⽐比从五个节点拷⻉贝要来的慢）

不同性能的机器设置不同的VNodes数量，能者多劳

VNodes（动态虚拟节点数量）

随机分布，range transfer：将每个节点连续的ranges进⾏行shuffle调度

http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes-2

un-event data distribution

event data distribution

http://docs.datastax.com/en/archived/cassandra/1.1/docs/cluster_architecture/partitioning.html#

Ring：集群所有节点组成的Data Range Token：每个节点都会在Ring上分配⼀一个或多个Token Token Range：(上⼀一个Token,到当前Token]的值范围 Walk Clockwise：顺时针⽅方向⾛走到对应的第⼀一个节点

分别对每个数据中⼼心都做均匀的分配，但是注意不能重叠分配Token

http://engineeringblog.yelp.com/2016/06/monitoring-cassandra-at-scale.html

数据被映射到带有虚拟节点的Ring圆环，四个节点，每个节点有3个虚拟节点，数据的副本数=3. 假设Ring环⼀一共有12个Token Ranges，并且三个副本以顺指针⽅方向的时钟⽅方式依次顺序地分布⽐比如key落在Token Range9，即8-9之间，这条key会被存储在Token=[9,10,11]的[A,B,C]三个节点正常的健康状态下，每个Token Range都有三个副本。当有节点当掉，某些Token Range副本减少

Token & Replicas Available

假设节点A当掉，属于A节点的所有Token Range都会受到影响，⼀一共有9个Token Range丢失考虑VNodes和副本，每个节点有三个虚拟节点，每条数据有三个副本，所以⼀一共有3*3=9个Token 举例映射到Token=8的key分布的Token=[8,9,10]=[D,A,B]，A当掉后，key can’t replicated to A 同样原先映射到7,8,9,11,12,1,3,4,5（内层可⽤用副本数=2的Token）的key都⽆无法复制/存储到节点A

如果⼀一致性级别=QUORUM，对于所有TokenRanges，仍有2/3的副本可⽤用，操作仍可以正常进⾏行

从当前节点(包括)开始的三个节点，可⽤用的节点数就表⽰示可⽤用的副本数 8->[8,9,10] -> 可⽤用:[8,10]=2 9->[9,10,11] -> 可⽤用:[8,10]=2 10->[10,11,12]-> 可⽤用:[10,11,12]=3

假设节点C⼜又当掉了，会再丢失掉6个Token Ranges。如果⼀一致性级别=QUORUM，则任何落在这6个Token Ranges的key是不可⽤用的，因为只有1/3，不满⾜足2/3的半数

从当前节点(包括)开始的三个节点，可⽤用的节点数就表⽰示可⽤用的副本数 8->[8,9,10] -> 可⽤用:[8,10]=2 9->[9,10,11] -> 可⽤用:[10]=1 10->[10,11,12]-> 可⽤用:[10,12]=2

Replication Strategy & Replication Factor

http://distributeddatastore.blogspot.com/2015/08/cassandra-replication.html

walking ring clockwise

Replication Strategy：如何选择副本（同⼀一个数据中⼼心以及不同数据中⼼心） Replication Factor：副本的数量（不同数据中⼼心可以有不同的副本数量）

SimpleStrategy：只适⽤用于⼀一个数据中⼼心，⼀一个机架的场景，rack unaware NetworkTopologyStrategy：多个数据中⼼心，both dc aware and rack aware

Data Partitioner：row key的计算⽅方式（Random，Murmur3，ByteOrdered）决定了数据在集群的节点中如何分布（包括副本） Replication Strategy是副本放置/选择的策略（如何选择副本所在的节点） Partitioner决定了数据在集群中会不会被均匀地分布，主要是⼀一种⽣生成Token的哈希算法

Dynamic Snitch: 监控读延迟，如果遇到性能差的节点，不会把请求路由给它，所有Snitch都默认开启

NetworkTopologyStrategy

DC1和DC2都有2个副本，⾸首先看key1之后的N2->DC2，属于DC2=[N2,N4,N5]，对应的RACK=[R1,R1,R2] 由于N2是RACK1，不能再选择同属于RACK1的N4，⽽而应该选择RACK2的N5，同理DC1也采⽤用类似的策略

DC2,RACK1✅

DC2,RACK1🙅

DC2,RACK2✅DC1,RACK1✅

DC1,RACK1🙅

DC1,RACK2✅

> Local Read：读取时不会跨不同的数据中⼼心

If no token is specified for the new node, Cassandra automatically splits the token range of the busiest node in the cluster.

The “busy” node streams half of its data to the new node in the cluster.

When the node finishes bootstrapping, it is available for client requests.

Vnodes simplify many tasks in Cassandra:• You no longer have to calculate and assign tokens to each node.• Rebalancing a cluster is no longer necessary when adding or removing nodes. When a node joins the cluster, it assumes(承担) responsibility

for an even(平等) portion of data from the other nodes in the cluster. If a node fails, the load is spread evenly across other nodes in the cluster.• Rebuilding(重建,不是删除) a dead node is faster because it involves(包含,牵涉) every other node(其他所有节点) in the cluster and because

data is sent to the replacement node(替代的节点) incrementally(增量发送) instead of waiting until the end of the validation phase.• Improves the use of heterogeneous machines in a cluster. You can assign a proportional number of vnodes to smaller and larger machines

When joining the cluster, a new node receives data from all other nodes.

The cluster is automatically balanced after the new node finishes bootstrapping.

Adding Capacity with VNodes or w/t VNodes

cluster = Cluster.builder().addContactPoints("192.168.50.100", "192.168.50.101").withLoadBalancingPolicy(new DCAwareRoundRobinPolicy("DC1")).withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE).build();

session = cluster.connect(keyspace);

• Each node handles client requests, but the balancing policy is configurable• Round Robin – evenly distributes queries across all nodes in the cluster, regardless of datacenter• DC-Aware Round Robin – prefers hosts in the local datacenter and only uses nodes in remote

datacenters when local hosts cannot be reached• Token-Aware – queries are first sent to local replicas

Load Balancing - Driver

Retry Policy - Client DriverA policy that defines a default behavior to adopt when a request returns an exception.

Such policy allows to centralize the handling of query retries, allowing to minimize the need for exception catching/handling in business code.

DowngradingConsistencyRetryPolicy - A retry policy that retries a query with a lower consistency level than the one initially requested.

CLIENT

local Remote

Round Robin

CLIENT

local Remote

DC Aware Round Robin

CLIENT

local Remote

The client attempts to contact nodes in the local datacenter.

CLIENT

local Remote

Remote nodes are used when local nodes cannot be reached.

The client sends a mutation (insert/update/delete) to a node in the cluster.That node serves as the coordinator for this transaction

Writing Data

The coordinator forwards the update to all replicas.

Writing Data

The replicas acknowledge that data was written.

Writing Data

And the coordinator sends a successful response to the client.

What if a node is down?

Only two nodes respond.The client gets to choose if the write was successful.Write Consistency Level = 2/Quorum

• ONE Returns data from the nearest replica.

• QUORUM Returns the most recent data from the majority of replicas.

• ALL Returns the most recent data from all replicas.

CL = QUORUMWill this write succeed?YES!!A majority of replicas received the mutation.

CL = QUORUMWill this write succeed?NO.Failed to write a majority of replicas.

The client can still decide how to proceed

CL = QUORUMDataStax Driver = DowngradingConsistencyRetryPolicy

Will this write succeed?YES!With consistency downgraded to ONE, the write will succeed.

Multi DC Writes

The coordinator forwards the mutation to local replicas and a remote coordinator.

DC1RF=3

DC2RF=3

The remote coordinator forwards the mutation to replicas in the remote DC

Multi DC Writes

DC1RF=3

DC2RF=3

All replicas acknowledge the write.

Multi DC Writes

DC1RF=3

DC2RF=3

Reading Data

The client sends a query to a node in the cluster.That node serves as the coordinator.

Reading Data

The coordinator forwards the query to all replicas.

Reading Data

The replicas respond with data.

Reading Data

And the coordinator returns the data to the client.

What if the nodes disagree?

Data was written with QUORUM when one node was down.The write was successful, but that node missed the update.

Now the node is back online, and it responds to a read request.It has older data than the other replicas.

The coordinator resolves the discrepancy and sends the newest data to the client.

READ REPAIRThe coordinator also notifies the “out of date” node that it has old data.The “out of date” node receives updated data from another replica.

NEWEST

What if I’m only reading from a single node? How will Cassandra know that a node has stale data? C* will occasionally request a hash from other nodes to compare.

Read Repair Chance

Hints provide a recovery mechanism for writes targeting offline nodes• Coordinator can store a hint if target node for a write is down or fails to acknowledge

Hinted Handoff

The write is replayed when the target node comes online

Hinted Handoff

If all replica nodes are down, the write can still succeed once a hint has been written.

Note that if all replica nodes are down at write time, than ANY write will not be readable until the replica nodes have recovered.

What if the hint is enough?

CL=ANY

During a read, does the coordinator really forward the query to all replicas?That seems unnecessary!

Rapid Read Protection

NOCassandra performs only as many requests as necessary to meet the requested Consistency Level. Cassandra routes requests to the most-responsive replicas.

If a replica doesn’t respond quickly, Cassandra will try another node.This is known as an “eager retry”

cassandra basic

Technology

cabs, cassandra, and hailo (at cassandra eu)

cassandra community webinar | cassandra 2.0 - better,...

cassandra summit 2014: apache cassandra at telefonica cbs

cassandra cluster management by japan cassandra community

paris cassandra meetup - cassandra for developers

apache cassandra™...

cassandra summit 2014: performance tuning cassandra in aws

cassandra core concepts - cassandra day toronto

cassandra at ebay - cassandra summit 2013

apache cassandra in action - o'reilly...

apache cassandra, part 3 – machinery, work with cassandra

distributed counters in cassandra (cassandra summit 2010)

apache cassandra in bangalore - cassandra internals and...

a guide to stress testing kafka, spark and cassandra … ·...

cassandra at ebay - cassandra summit 2012

solr & cassandra: searching cassandra with datastax...

apache cassandra at target - cassandra summit 2014

cassandra day atlanta 2015: python & cassandra

running cassandra on amazon’s ecs -...

cassandra sf 2015 - repeatable, scalable, reliable,...