cassandra basic

66
Cassandra 基础知识

Upload: zqhxuyuan

Post on 11-Apr-2017

67 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: Cassandra basic

Cassandra基础知识

Page 2: Cassandra basic

所有节点在同⼀一时间具有相同的数据⼀一致性

可⽤用性

分区容忍性系统中任意信息的丢失或失败不会影响系统的继续运⾏行

保证对每个客户端请求⽆无论成功与否都有响应

Page 3: Cassandra basic

C:⼀一致性(Consistency) (所有节点在同⼀一时间具有相同的数据) A:可⽤用性(Availability) (保证每个请求不管成功或者失败都有响应) P:分隔容忍(Partition tolerance) (系统中任意信息的丢失或失败不影响系统的继续运作)

CA - 满⾜足⼀一致性,可⽤用性的系统,通常在可扩展性上不太强⼤大,⽐比如单点集群。 CP - 满⾜足⼀一致性,分区容忍性的系统,通常性能不是特别⾼高。 AP - 满⾜足可⽤用性,分区容忍性的系统,通常对⼀一致性要求低⼀一些。

Page 4: Cassandra basic

Hash->DHT->VNodes

Page 5: Cassandra basic

1.将“tokyo”传给函数库后,客户端实现的算法就会根据“键”来决定保存数据的服务器。选定服务器后,即命令它保存“tokyo”及其值。 2.获取保存的数据,也要将要获取的键“tokyo”传递给函数库。 函数库通过与数据保存时相同的算法,根据“键”选择服务器。 使⽤用的算法相同,就能选中与保存时相同的服务器,然后发送get命令。只要数据没有因为某些原因被删除,就能获得保存的值。

memcached全⾯面剖析: http://charlee.li/memcached-004.html

hash分布

Page 6: Cassandra basic

key的范围是0到2^32形成⼀一个环,叫做hash空间环(hash的值空间)。对集群的服务器(⽐比如ip地址)进⾏行hash,都能确定其在环空间上的位置。 定位数据访问到相应服务器的算法:将数据key使⽤用相同的函数H计算出哈希值h,根据h确定此数据在环上的位置:从key在环中的位置沿着环顺时针“⾏行⾛走”,第⼀一台遇到的服务器就是其应该定位到的服务器。

1.求出服务器(节点)的哈希值, 并将其配置到0〜~2^32的圆上。 然后⽤用同样的⽅方法求出存储数据的键的哈希值,也映射到圆上。 2.从数据映射到的位置开始顺时针查找,将数据保存到找到的第⼀一个服务器上。如果超过2^32仍然找不到服务器,则保存到第⼀一台服务器上。

⼀一致性Hash(DHT)

Page 7: Cassandra basic

⼀一致性哈希算法最⼤大限度地抑制了键的重新分布。不过使⽤用⼀一般的hash函数,服务器的映射地点的分布⾮非常不均匀。改进⽅方案:采⽤用虚拟节点为每个物理节点在圆环上分配100〜~200个点。这样就能抑制分布不均匀, 最⼤大限度地减⼩小服务器增减时的缓存重新分布。

⼀一致性Hash(添加节点)添加⼀一台服务器。余数分布式算法由于保存键的服务器会发⽣生巨⼤大变化⽽而影响缓存的命中率,但 ⼀一致性哈希算法中,只有从环上增加服务器的地点逆时针⽅方向的第⼀一台服务器上的键会受到影响

Page 8: Cassandra basic

jim

johnny

suzy

carol

⼀一致性Hash(⽰示例)

!️

#️

$️

Page 9: Cassandra basic

PartitionKey

Page 10: Cassandra basic
Page 11: Cassandra basic

http://docs.basho.com/riak/kv/2.1.4/learn/concepts/clusters/

由于⼀一致性哈希算法在服务节点太少时,容易因为节点分部不均匀⽽而造成数据倾斜问题,所以引⼊入了虚拟节点:把每台服务器(n台)分成v个虚拟节点,再把所有虚拟节点(n*v)随机分配到⼀一致性哈希的圆环上,这样key从在圆环上的位置顺时针往下取到的第⼀一个vnode就是⾃自⼰己的所属节点

⼀一致性Hash(vnodes)

⼀一个Ring有32个分区,集群有4个节点,每个节点有8个vnodes

Page 12: Cassandra basic

key经过hash会定位到hash环上的⼀一个位置, 找到下⼀一个vnode为数据的第⼀一份存储节点. 接下来的两个vnode为另外两个副本.

vnodes & replicas

副本1副本2

副本3

Page 13: Cassandra basic

http://www.littleriakbook.com/

Page 14: Cassandra basic

5个节点,总共64个Partition 每个节点⼤大概有12.8个vnode

Page 15: Cassandra basic

www.tom-e-white.com/2007/11/consistent-hashing.html

keynode

Page 16: Cassandra basic

http://www.datastax.com/dev/blog/virtual-nodes-in-cassandra-1-2

⼀一个节点⼀一个Token Range …=> VNodes:⼀一个节点多个不连续的⼩小的Token Range

每份数据有三个副本

Node1

16个Partition 3个Replicas 6个节点 每节点=16*3/6=8

Single Token & Virtual Tokens

Token Range

Page 17: Cassandra basic

greater number of smaller ranges faster than single token per node to rebuild the replacement node 重新相同数据量,多⽽而⼩小的速度要⽐比⼩小⽽而⼤大的来的快(从三个节点拷⻉贝⽐比从五个节点拷⻉贝要来的慢)

Page 18: Cassandra basic

不同性能的机器设置不同的VNodes数量,能者多劳

VNodes(动态虚拟节点数量)

Page 19: Cassandra basic

随机分布,range transfer:将每个节点连续的ranges进⾏行shuffle调度

http://www.datastax.com/dev/blog/upgrading-an-existing-cluster-to-vnodes-2

Page 20: Cassandra basic

un-event data distribution

event data distribution

http://docs.datastax.com/en/archived/cassandra/1.1/docs/cluster_architecture/partitioning.html#

Ring:集群所有节点组成的Data Range Token:每个节点都会在Ring上分配⼀一个或多个Token Token Range:(上⼀一个Token,到当前Token]的值范围 Walk Clockwise:顺时针⽅方向⾛走到对应的第⼀一个节点

🙅

分别对每个数据中⼼心都做均匀的分配,但是注意不能重叠分配Token

Page 21: Cassandra basic

http://engineeringblog.yelp.com/2016/06/monitoring-cassandra-at-scale.html

数据被映射到带有虚拟节点的Ring圆环,四个节点,每个节点有3个虚拟节点,数据的副本数=3. 假设Ring环⼀一共有12个Token Ranges,并且三个副本以顺指针⽅方向的时钟⽅方式依次顺序地分布 ⽐比如key落在Token Range9,即8-9之间,这条key会被存储在Token=[9,10,11]的[A,B,C]三个节点 正常的健康状态下,每个Token Range都有三个副本。当有节点当掉,某些Token Range副本减少

Token & Replicas Available

Page 22: Cassandra basic

假设节点A当掉,属于A节点的所有Token Range都会受到影响,⼀一共有9个Token Range丢失 考虑VNodes和副本,每个节点有三个虚拟节点,每条数据有三个副本,所以⼀一共有3*3=9个Token 举例映射到Token=8的key分布的Token=[8,9,10]=[D,A,B],A当掉后,key can’t replicated to A 同样原先映射到7,8,9,11,12,1,3,4,5(内层可⽤用副本数=2的Token)的key都⽆无法复制/存储到节点A

如果⼀一致性级别=QUORUM,对于所有TokenRanges,仍有2/3的副本可⽤用,操作仍可以正常进⾏行

从当前节点(包括)开始的三个节点, 可⽤用的节点数就表⽰示可⽤用的副本数 8->[8,9,10] -> 可⽤用:[8,10]=2 9->[9,10,11] -> 可⽤用:[8,10]=2 10->[10,11,12]-> 可⽤用:[10,11,12]=3

Page 23: Cassandra basic

假设节点C⼜又当掉了,会再丢失掉6个Token Ranges。如果⼀一致性级别=QUORUM, 则任何落在这6个Token Ranges的key是不可⽤用的,因为只有1/3,不满⾜足2/3的半数

从当前节点(包括)开始的三个节点, 可⽤用的节点数就表⽰示可⽤用的副本数 8->[8,9,10] -> 可⽤用:[8,10]=2 9->[9,10,11] -> 可⽤用:[10]=1 10->[10,11,12]-> 可⽤用:[10,12]=2

Page 24: Cassandra basic

Replication Strategy & Replication Factor

http://distributeddatastore.blogspot.com/2015/08/cassandra-replication.html

walking ring clockwise

Replication Strategy:如何选择副本(同⼀一个数据中⼼心以及不同数据中⼼心) Replication Factor:副本的数量(不同数据中⼼心可以有不同的副本数量)

SimpleStrategy:只适⽤用于⼀一个数据中⼼心,⼀一个机架的场景,rack unaware NetworkTopologyStrategy:多个数据中⼼心,both dc aware and rack aware

Data Partitioner:row key的计算⽅方式(Random,Murmur3,ByteOrdered) 决定了数据在集群的节点中如何分布(包括副本) Replication Strategy是副本放置/选择的策略(如何选择副本所在的节点) Partitioner决定了数据在集群中会不会被均匀地分布,主要是⼀一种⽣生成Token的哈希算法

Dynamic Snitch: 监控读延迟,如果遇到性能差的节点,不会把请求路由给它,所有Snitch都默认开启

Page 25: Cassandra basic

NetworkTopologyStrategy

DC1和DC2都有2个副本,⾸首先看key1之后的N2->DC2,属于DC2=[N2,N4,N5],对应的RACK=[R1,R1,R2] 由于N2是RACK1,不能再选择同属于RACK1的N4,⽽而应该选择RACK2的N5,同理DC1也采⽤用类似的策略

DC2,RACK1✅

DC2,RACK1🙅

DC2,RACK2✅DC1,RACK1✅

DC1,RACK1🙅

DC1,RACK2✅

> Local Read:读取时不会跨不同的数据中⼼心

Page 26: Cassandra basic
Page 27: Cassandra basic
Page 28: Cassandra basic
Page 29: Cassandra basic
Page 30: Cassandra basic
Page 31: Cassandra basic
Page 32: Cassandra basic
Page 33: Cassandra basic

If no token is specified for the new node, Cassandra automatically splits the token range of the busiest node in the cluster.

The “busy” node streams half of its data to the new node in the cluster.

When the node finishes bootstrapping, it is available for client requests.

Vnodes simplify many tasks in Cassandra:• You no longer have to calculate and assign tokens to each node.• Rebalancing a cluster is no longer necessary when adding or removing nodes. When a node joins the cluster, it assumes(承担) responsibility

for an even(平等) portion of data from the other nodes in the cluster. If a node fails, the load is spread evenly across other nodes in the cluster.• Rebuilding(重建,不是删除) a dead node is faster because it involves(包含,牵涉) every other node(其他所有节点) in the cluster and because

data is sent to the replacement node(替代的节点) incrementally(增量发送) instead of waiting until the end of the validation phase.• Improves the use of heterogeneous machines in a cluster. You can assign a proportional number of vnodes to smaller and larger machines

When joining the cluster, a new node receives data from all other nodes.

The cluster is automatically balanced after the new node finishes bootstrapping.

Adding Capacity with VNodes or w/t VNodes

Page 34: Cassandra basic

cluster = Cluster.builder().addContactPoints("192.168.50.100", "192.168.50.101").withLoadBalancingPolicy(new DCAwareRoundRobinPolicy("DC1")).withRetryPolicy(DowngradingConsistencyRetryPolicy.INSTANCE).build();

session = cluster.connect(keyspace);

• Each node handles client requests, but the balancing policy is configurable• Round Robin – evenly distributes queries across all nodes in the cluster, regardless of datacenter• DC-Aware Round Robin – prefers hosts in the local datacenter and only uses nodes in remote

datacenters when local hosts cannot be reached• Token-Aware – queries are first sent to local replicas

Load Balancing - Driver

Retry Policy - Client DriverA policy that defines a default behavior to adopt when a request returns an exception.

Such policy allows to centralize the handling of query retries, allowing to minimize the need for exception catching/handling in business code.

DowngradingConsistencyRetryPolicy - A retry policy that retries a query with a lower consistency level than the one initially requested.

Page 35: Cassandra basic
Page 36: Cassandra basic
Page 37: Cassandra basic
Page 38: Cassandra basic

CLIENT

local Remote

Round Robin

Page 39: Cassandra basic

CLIENT

local Remote

DC Aware Round Robin

Page 40: Cassandra basic

CLIENT

local Remote

DC Aware Round Robin

The client attempts to contact nodes in the local datacenter.

Page 41: Cassandra basic

CLIENT

local Remote

Remote nodes are used when local nodes cannot be reached.

DC Aware Round Robin

Page 42: Cassandra basic

The client sends a mutation (insert/update/delete) to a node in the cluster.That node serves as the coordinator for this transaction

Writing Data

RF=3

Page 43: Cassandra basic

Writing Data

The coordinator forwards the update to all replicas.

RF=3

Page 44: Cassandra basic

Writing Data

The replicas acknowledge that data was written.

RF=3

Page 45: Cassandra basic

Writing Data

And the coordinator sends a successful response to the client.

RF=3

Page 46: Cassandra basic

What if a node is down?

Only two nodes respond.The client gets to choose if the write was successful.Write Consistency Level = 2/Quorum

RF=3

• ONE Returns data from the nearest replica.

• QUORUM Returns the most recent data from the majority of replicas.

• ALL Returns the most recent data from all replicas.

Page 47: Cassandra basic

CL = QUORUMWill this write succeed?YES!!A majority of replicas received the mutation.

RF=3

What if a node is down?

Page 48: Cassandra basic

CL = QUORUMWill this write succeed?NO.Failed to write a majority of replicas.

RF=3

What if a node is down?

Page 49: Cassandra basic

The client can still decide how to proceed

CL = QUORUMDataStax Driver = DowngradingConsistencyRetryPolicy

Will this write succeed?YES!With consistency downgraded to ONE, the write will succeed.

RF=3

Page 50: Cassandra basic

Multi DC Writes

The coordinator forwards the mutation to local replicas and a remote coordinator.

DC1RF=3

DC2RF=3

Page 51: Cassandra basic

The remote coordinator forwards the mutation to replicas in the remote DC

Multi DC Writes

DC1RF=3

DC2RF=3

Page 52: Cassandra basic

All replicas acknowledge the write.

Multi DC Writes

DC1RF=3

DC2RF=3

Page 53: Cassandra basic

Reading Data

The client sends a query to a node in the cluster.That node serves as the coordinator.

RF=3

Page 54: Cassandra basic

Reading Data

The coordinator forwards the query to all replicas.

RF=3

Page 55: Cassandra basic

Reading Data

The replicas respond with data.

RF=3

Page 56: Cassandra basic

Reading Data

And the coordinator returns the data to the client.

RF=3

Page 57: Cassandra basic

What if the nodes disagree?

Data was written with QUORUM when one node was down.The write was successful, but that node missed the update.

RF=3

WRITE

Page 58: Cassandra basic

Now the node is back online, and it responds to a read request.It has older data than the other replicas.

RF=3

What if the nodes disagree?

READ

Page 59: Cassandra basic

The coordinator resolves the discrepancy and sends the newest data to the client.

READ REPAIRThe coordinator also notifies the “out of date” node that it has old data.The “out of date” node receives updated data from another replica.

RF=3

What if the nodes disagree?

NEWEST

Page 60: Cassandra basic

What if I’m only reading from a single node? How will Cassandra know that a node has stale data? C* will occasionally request a hash from other nodes to compare.

RF=3

Read Repair Chance

HASH

Page 61: Cassandra basic

Hints provide a recovery mechanism for writes targeting offline nodes• Coordinator can store a hint if target node for a write is down or fails to acknowledge

Hinted Handoff

HINT

Page 62: Cassandra basic

The write is replayed when the target node comes online

Hinted Handoff

HINT

Page 63: Cassandra basic

If all replica nodes are down, the write can still succeed once a hint has been written.

Note that if all replica nodes are down at write time, than ANY write will not be readable until the replica nodes have recovered.

What if the hint is enough?

HINT

CL=ANY

Page 64: Cassandra basic

During a read, does the coordinator really forward the query to all replicas?That seems unnecessary!

Rapid Read Protection

RF=3

Page 65: Cassandra basic

NOCassandra performs only as many requests as necessary to meet the requested Consistency Level. Cassandra routes requests to the most-responsive replicas.

Rapid Read Protection

RF=3

Page 66: Cassandra basic

If a replica doesn’t respond quickly, Cassandra will try another node.This is known as an “eager retry”

Rapid Read Protection

RF=3