storage cassandra

Cassandra

Roc.Yang

2011.04

Contents

Overview1

2 Data Model

Storage Model3

4 System Architecture

Read & Write5

6 Other

Cassandra

Overview

Cassandra From Facebook

Cassandra To

Cassandra – From Dynamo and Bigtable

Cassandra is a highly scalable, eventually consistent, distributed, structured key-value store. Cassandra brings together the distributed systems technologies from Dynamo and the data model from Google's BigTable. Like Dynamo, Cassandra is eventually consistent. Like BigTable, Cassandra provides a ColumnFamily-

based data model richer than typical key/value systems. Cassandra was open sourced by Facebook in 2008, where

it was designed by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik ( Facebook Engineer). In a lot of ways you can think of Cassandra as Dynamo 2.0 or a marriage of Dynamo and BigTable.

Cassandra - Overview

Cassandra is a distributed storage system for managing very large amounts of structured data spread out across many commodity servers, while providing highly available service with no single point of failure;

Cassandra does not support a full relational data model; instead, it provides clients with a simple data model that supports dynamic control over data layout and format.

Cassandra - Highlights

● High availability

● Incremental scalability

● Eventually consistent

● Tunable tradeoffs between consistency and latency

● Minimal administration

● No SPF(Single Point of Failure).

Cassandra – Trade Offs

● No Transactions

● No Adhoc Queries

● No Joins

● No Flexible Indexes

•Data Modeling with Cassandra Column Familieshttp://www.slideshare.net/gdusbabek/data-modeling-with-cassandra-column-families

Cassandra From Dynamo and BigTable

•Introduction to Cassandra: Replication and Consistency http://www.slideshare.net/benjaminblack/introduction-to-cassandra-replication-and-consistency

Dynamo-like Features

● Symmetric, P2P Architecture No Special Nodes/SPOFs

● Gossip-based Cluster Management

● Distributed Hash Table for Data Placement Pluggable Partitioning Pluggable Topology Discovery Pluggable Placement Strategies

● Tunable, Eventual Consistency


BigTable-like Features

● Sparse “Columnar” Data Model Optional, 2-level Maps Called Super Column

Families

● SSTable Disk Storage Append-only Commit Log Memtable(buffer and sort) Immutable SSTable Files

● Hadoop Integration


Brewer's CAP Theorem

CAP(Consistency, Availability and Partition Tolerance). Pick two of Consistency, Availability, Partition tolerance. Theorem: You can have at most two of these properties

for any shared-data system.

http://www.julianbrowne.com/article/viewer/brewers-cap-theorem

ACID & BASE

ACID (Atomicity, Consistency, Isolation, Durability). BASE (Basically Available, Soft-state, Eventually

Consistent)

ACID: http://en.wikipedia.org/wiki/ACID ACID and BASE: MySQL and NoSQL:

http://www.schoonerinfotech.com/solutions/general/what_is_nosql

ACIDStrong consistencyIsolationFocus on “commit”Nested transactionsAvailability?Conservative

(pessimistic)Difficult evolution

(e.g. schema)

BASEWeak consistency

– stale data OKAvailability firstBest effortApproximate answers OKAggressive (optimistic)Simpler!FasterEasier evolution

NoSQL

The term "NoSQL" was used in 1998 as the name for a lightweight, open source relational database that did not expose a SQL interface. Its author, Carlo Strozzi, claims that as the NoSQL movement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect.“ CAP BASE Eventual Consistency

NoSQL: http://en.wikipedia.org/wiki/NoSQL

http://nosql-database.org/

Dynamo & Bigtable

Dynamo partitioning and replicationLog-structured ColumnFamily data model similar

to Bigtable's

● Bigtable: A distributed storage system for structured data, 2006

● Dynamo: amazon's highly available keyvalue store, 2007

Dynamo & Bigtable

● BigTableStrong consistencySparse map data modelGFS, Chubby, etc

● DynamoO(1) distributed hash table (DHT)BASE (eventual consistency)Client tunable consistency/availability

Dynamo & Bigtable

●CPBigtableHypertableHBase

● APDynamoVoldemortCassandra

Cassandra

Dynamo Overview

Dynamo Architecture & Lookup

● O(1) node lookup

● Explicit replication

● Eventually consistent

Dynamo

Dynamo:

a highly available key-value storage system that some of Amazon’s core services use to provide an “always-on” experience.

a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor.

Service-Oriented Architecture

Dynamo Techniques

问题采取的相关技术

数据均衡分布改进的一致性哈希算法，数据备份

数据冲突处理向量时钟（ vector clock ）

临时故障处理 Hinted handoff （数据回传机制），参数（W,R,N ）可调的弱 quorum 机制

永久故障后的恢复 Merkle 哈希树

成员资格以及错误检测基于 gossip 的成员资格协议和错误检测

Dynamo 架构的主要技术

Dynamo Techniques Advantages

Summary of techniques used in Dynamo and their advantages

Dynamo 数据均衡分布的问题

一致性哈希算法优势： -- 负载均衡 -- 屏蔽节点处理

能力差异虚拟节点A

虚拟节点B

虚拟节点C

虚拟节点D

键 k

节点A

节点B

节点C节点D

节点E

节点F

节点G计算节点的哈希值

计算数据键值的哈希值

Dynamo 数据冲突处理

最终一致性模型向量时钟（ Vector Clock ）

Dynamo 临时故障处理机制

读写参数 W、 R、 N

N ：系统中每条记录的副本数W ：每次记录成功写操作需要写入的副本数R ：每次记录读请求最少需要读取的副本数。

满足 R+W>N ，用户即可自行配置 R和W优势：实现可用性与容错性之间的平衡

Dynamo 永久性故障恢复

0

1 2

3 4 5 6

7 8 9 10 11 12 13 14

11

1 15

3 4 16 6

7 8 9 10 17 12 13 14

merkle树A merkle树B

Merkle 哈希树技术Dynamo中Merkle 哈希树的叶子节点是存储数据所对应的哈希值，父节点是其所有子节点的哈希值

Dynamo 成员资格及错误检测

基于 Gossip 协议的成员检测机制

种子节点（seed）

A

新节点2

C

B

新节点1

Consistent Hashing - Dynamo

Dynamo 把每台 server 分成 v 个虚拟节点，再把所有虚拟节点 (n*v) 随机分配到一致性哈希的圆环上，这样所有的用户从自己圆环上的位置顺时针往下取到第一个 vnode 就是自己所属节点。当此节点存在故障时，再顺时针取下一个作为替代节点。

发生单点故障时负载会均衡分散到其他所有节点，程序实现也比较优雅。

Consistent Hashing - Dynamo

Cassandra

Bigtable Overview

Bigtable

Replica

Replica

Replica

Replica

Master

GFS(Google File System)

Bigtable

TabletServer

TabletServer

TabletServer

Chubby

Client

Cluster Managermement System

Bigtable

Tablet

在 BigtableT 中，对表进行切片，一个切片称为 tablet ，保证 100－ 200MB/tablet

Column Families① the basic unit of access control;

② All data stored in a column family is usually of the same type (we compress data in the same column family together).

Timestamp

Each cell in a Bigtable can contain multiple versions of the same data; these versions are indexed by timestamp.

Treats data as uninterpreted strings

Bigtable: Data Model

<Row, Column, Timestamp> triple for key - lookup, insert, and delete API

Arbitrary “columns” on a row-by-row basis Column family:qualifier. Family is heavyweight, qualifier

lightweight Column-oriented physical store- rows are sparse!

Does not support a relational model No table-wide integrity constraints No multirow transactions

a three-level hierarchy analogous to that of a B+ tree to store tablet location information

Bigtable: Tablet location hierarchy

Bigtable: METADATA

The first level is a file stored in Chubby that contains the location of the root tablet

The root tablet contains the location of all tablets in a special METADATA table

The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identier and its end row

Each METADATA row stores approximately 1KB of data in memory

METADATA table also stores secondary information, including a log of all events pertaining to each tablet (such as when a server begins serving it). This information is helpful for debugging and performance analysis

Bigtable: Tablet Representation

Bigtable: SSTable

Cassandra

Data Model

Cassandra – Data Model

A table in Cassandra is a distributed multi dimensional map indexed by a key. The value is an object which is highly structured.

Every operation under a single row key is atomic per replica no matter how many columns are being read or written into.

Columns are grouped together into sets called column families (very much similar to what happens in the Bigtable system. Cassandra exposes two kinds of columns families, Simple and Super column families.

Super column families can be visualized as a column family within a column family


Columns are added and modified

dynamically

KEYColumnFamily1 Name : MailList Type : Simple Sort : Name

Name : tid1

Value : <Binary>

TimeStamp : t1

Name : tid2

Value : <Binary>

TimeStamp : t2

Name : tid3

Value : <Binary>

TimeStamp : t3

Name : tid4

Value : <Binary>

TimeStamp : t4

ColumnFamily2 Name : WordList Type : Super Sort : Time

Name : aloha

ColumnFamily3 Name : System Type : Super Sort : Name

Name : hint1

<Column List>

Name : hint2

<Column List>

Name : hint3

<Column List>

Name : hint4

<Column List>

C1

V1

T1

C2

V2

T2

C3

V3

T3

C4

V4

T4

Name : dude

C2

V2

T2

C6

V6

T6

Column Families are declared

upfront

SuperColumns are added and

modified dynamically

Columns are added and modified dynamically


Keyspace Uppermost namespace Typically one per application ~= database

ColumnFamily Associates records of a similar kind not same kind, because CFs are sparse tables Record-level Atomicity Indexed

Row each row is uniquely identifiable by key rows group columns and super columns

Column Basic unit of storage

Cassandra – Data Model(a example)


http://www.divconq.com/2010/cassandra-columns-and-supercolumns-and-rows/

Cassandra – Data Model - Cluster

Cluster

Cassandra – Data Model - Cluster

Cluster > Keyspace

Partitioners:OrderPreservingPartitioner

RandomPartitioner

Like an RDBMS schema:Keyspace per application


Cluster > Keyspace > Column Family

Like an RDBMS table:Separates types in an app


SortedMap<Name,Value>...

Cluster > Keyspace > Column Family > Row


Cluster > Keyspace > Column Family > Row > “Column”

…Name → Valuebyte[] → byte[]

+version timestamp

Not like an RDBMS column:Attribute of the row: each row can

contain millions of different columns


Any column within a column family is accessed using the convention:

column family : column Any column within a column family that is of type

super is accessed using the convention:

column family :super column : column

Cassandra

Storage Model

Storage Model

Key (CF1 , CF2 , CF3)

Commit LogBinary serialized

Key ( CF1 , CF2 , CF3 )

Memtable ( CF1)

Memtable ( CF2)

Memtable ( CF2)

• Data size

• Number of Objects

• Lifetime

Dedicated Disk

<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>

---

---

---

---

<Key name><Size of key Data><Index of columns/supercolumns>< Serialized column family>

BLOCK Index <Key Name> Offset, <Key Name> Offset

K128 Offset

K256 Offset

K384 Offset

Bloom Filter(Index in memory)

Data file on disk

Storage Model-Compactions

K1 < Serialized data >



--

--

--

Sorted




--

--

--

Sorted




--

--

--

Sorted

MERGE SORT








Sorted

K1 Offset

K5 Offset

K30 Offset

Bloom Filter

Loaded in memory

Index File

Data File

D E L E T E D

Storage Model - Write

客户端给 Cassandra 集群的任一随机节点发送写请求

" 分割器 " 决定由哪个节点对此数据负责 RandomPartitioner ( 完全按照 Hash 进行分布 ) OrderPreservingPartitioner( 按照数据的原始顺序排序 )

Owner 节点先在本地记录日志 , 然后将其应用到内存副本 (MemTable)

提交日志 (Commit Log) 保存在机器本地的一个独立磁盘上 .

Storage Model - Write

关键路径上没有任何锁顺序磁盘访问表现类似于写入式缓存 (write through cache)只有 Append 操作 , 没有额外的读开销只保证基于 ColumnFamily 的原子性始终可写 ( 利用 Hinted Handoff)即使在出现节点故障时仍然可写

Storage Model - Read

从任一节点发起读请求由 " 分割器 " 路由到负责的节点等待 R 个响应在后台等待 N - R 个响应并处理 Read Repair

• 读取多个 SSTable• 读速度比写速度要慢 ( 不过仍然很快 )

• 通过使用 BloomFilter 降低检索 SSTable 的次数• 通过使用 Key/Column index来提供在 SSTable 检索 Key 以及 Column 的效率

• 可以通过提供更多内存来降低检索时间 / 次数• 可以扩展到百万级的记录

Cassandra – Storage

Cassandra 的存储机制，借鉴了 Bigtable 的设计，采用Memtable和 SSTable 的方式。和关系数据库一样， Cassandra 在写数据之前，也需要先记录日志，称之为 commitlog ，然后数据才会写入到 Column Family对应的 Memtable 中，并且Memtable 中的内容是按照key 排序好的。 Memtable 是一种内存结构，满足一定条件后批量刷新到磁盘上，存储为 SSTable 。这种机制，相当于缓存写回机制 (Write-back Cache) ，优势在于将随机 IO 写变成顺序 IO 写，降低大量的写操作对于存储系统的压力。 SSTable 一旦完成写入，就不可变更，只能读取。下一次 Memtable 需要刷新到一个新的 SSTable文件中。所以对于 Cassandra 来说，可以认为只有顺序写，没有随机写操作。 SSTable: http://wiki.apache.org/cassandra/ArchitectureSSTable


因为 SSTable 数据不可更新，可能导致同一个 Column Family 的数据存储在多个 SSTable 中，这时查询数据时，需要去合并读取 Column Family 所有的 SSTable和Memtable ，这样到一个 Column Family 的数量很大的时候，可能导致查询效率严重下降。因此需要有一种机制能快速定位查询的 Key落在哪些 SSTable 中，而不需要去读取合并所有的 SSTable。 Cassandra采用的是 Bloom Filter 算法，通过多个 hash函数将 key映射到一个位图中，来快速判断这个 key 属于哪个 SSTable 。


为了避免大量 SSTable带来的性能影响， Cassandra 也提供一种定期将多个 SSTable合并成一个新的 SSTable 的机制，因为每个 SSTable 中的 key都是已经排序好的，因此只需要做一次合并排序就可以完成该任务，代价还是可以接受的。所以在 Cassandra 的数据存储目录中，可以看到三种类型的文件，格式类似于： Column Family Name- 序号 -Data.db Column Family Name- 序号 -Filter.db Column Family Name- 序号 -index.db

其中 Data.db文件是 SSTable 数据文件， SSTable是Sorted Strings Table 的缩写，按照 key 排序后存储 key/value 键值字符串。 index.db 是索引文件，保存的是每个 key 在数据文件中的偏移位置，而 Filter.db则是 Bloom Filter 算法生产的映射文件。。

Cassandra

System Architecture

System Architecture Content

OverviewPartitioningReplicationMembership & Failure DetectionBootstrappingScaling the ClusterLocal PersistenceCommunication

System Architecture

Core Layer Middle Layer Top Layer

Messaging service Gossip Failure detection Cluster state Partitioner Replication

Commit log Memtable SSTable Indexes Compaction

Tombstones Hinted handoff Read repair Bootstrap Monitoring Admin tools

System Architecture

Core Layer Middle Layer Top Layer Above the top layer

System Architecture

Core Layer:

§ Messaging Service (async, non-blocking)

§ Gossip Failure detector

§ Cluster membership/state

§ Partitioner(Partitioning scheme)

§ Replication strategy

System Architecture

Middle Layer

§ Commit log

§ Memory-table

§ Compactions

§ Hinted handoff

§ Read repair

§ Bootstrap

System Architecture

Top Layer

§ Key, block, & column indexes

§ Read consistency

§ Touch cache

§ Cassandra API

§ Admin API

§ Read Consistency

System Architecture

Above the top layer:

§ Tools

§ Hadoop integration

§ Search API and Routing

System Architecture

Messaging Layer

Cluster MembershipFailure Detector

Storage Layer

Partitioner Replicator

Cassandra API Tools

Cassandra - Architecture

System Architecture

The architecture of a storage system needs to have the following characteristics:

scalable and robust solutionsfor load balancing membership and failure detection failure recovery replica synchronization overload handling state transfer concurrency and job scheduling request marshalling request routing system monitoring and alarming conguration management

System Architecture

we will focus on the core distributed systems techniques used in Cassandra:

partitioningreplicationmembershipFailure handlingScalingAll these modules work in synchrony to handle read/write

requests

System Architecture - Partitioning

One of the key design features for Cassandra is the ability to scale incrementally. This requires, the ability to dynamically partition the data over the set of nodes in the cluster. Cassandra partitions data across the cluster using consistent hashing but uses an order preserving hash function to do so.

Cassandra uses Consistent-Hashing. The idea is that all the nodes hash-wised are located on a ring. The position of a node on the ring is randomly determined. Each node is responsible for replicated a range of hash function’s output space.

System Architecture – Partitioning (Ring Topology)

a

j

g

d

RF=3

Conceptual Ring

One token per node

Multiple ranges per node

a

j

g

d

RF=2

Conceptual Ring

One token per node

Multiple ranges per node

System Architecture – Partitioning (Ring Topology)

Token assignment

Range adjustment

Bootstrap

Arrival only affects immediate neighbors

a

j

g

d

RF=3

m

System Architecture – Partitioning (New Node)

Node dies

Available?HintingHandoff

Achtung!Plan for this

a

j

g

d

RF=3

System Architecture – Partitioning (Ring Partition)

System Architecture – Partitioning

在 Cassandra 实际的环境，一个必须要考虑的关键问题是Token 的选择。 Token 决定了每个节点存储的数据的分布范围，每个节点保存的数据的 key在 (前一个节点 Token，本节点 Token] 的半开半闭区间内，所有的节点形成一个首尾相接的环，所以第一个节点保存的是大于最大 Token小于等于最小 Token 之间的数据 ;

根据采用的分区策略的不同， Token 的类型和设置原则也有所不同。 Cassandra (0.6版本 ) 本身支持三种分区策略：

RandomPartitioner

OrderPreservingPartitioner

CollatingOrderPreservingPartitioner


RandomPartitioner ：随机分区是一种 hash 分区策略，使用的 Token 是大整数型 (BigInteger) ，范围为 [0 ~ 2^127] ，因此极端情况下，一个采用随机分区策略的Cassandra 集群的节点可以达到 (2^127 + 1) 个节点。

Cassandra 采用了 MD5 作为 hash 函数，其结果是 128位的整数值 ( 其中一位是符号位， Token 取绝对值为结果 ) 。采用随机分区策略的集群无法支持针对 Key 的范围查询。假如集群有 N 个节点，每个节点的 hash 空间采取平均分布的话，那么第 i 个节点的 Token 可以设置为：

i * ( 2 ^ 127 / N )


OrderPreservingPartitioner ：如果要支持针对 Key 的范围查询，那么可以选择这种有序分区策略。该策略采用的是字符串类型的 Token 。每个节点的具体选择需要根据Key 的情况来确定。如果没有指定 InitialToken ，则系统会使用一个长度为 16 的随机字符串作为 Token ，字符串包含大小写字符和数字。

CollatingOrderPreservingPartitioner ：和OrderPreservingPartitioner 一样是有序分区策略。只是排序的方式不一样，采用的是字节型 Token ，支持设置不同语言环境的排序方式，代码中默认是 en_US 。


Randomsystem will use MD5 (key)

to distribute data across nodes

even distribution of keys from one CF across ranges/nodes

Order Preservingkey distribution determined

by token lexicographical orderingcan specify the token for

this node to use ‘scrabble’ distribution required for range queries

– scan over rows like cursor in index

System Architecture – Partitioning - Token

A Token is partitioner-dependent element on the Ring. Each Node has a single, unique Token. Each Node claims a Range of the Ring from its Token

to the Token of the previous Node on the Ring.


Map from Key Space to Token RandomPartitioner

Tokens are integers in the range [0 .. 2^127] MD5(Key) Token Good: Even Key distribution Bad: Inefficient range queries

OrderPreservingPartitioner Tokens are UTF8 strings in the range [“” .. ) Key Token Good: Inefficient range queries Bad: UnEven Key distribution

System Architecture – Snitching

Map from Nodes to Physical Location EndpointSnitch

Guess at rack and DataCenter based on IP address octets DataCenterEndpointSnitch

Specify IP subnets for racks, grouped per DataCenter PropertySnitch

Specify arbitrary mappings from indivdual IP address to racks and DataCenters

System Architecture - Replication

Cassandra uses replication to achieve high availability and durability.

Each data item is replicated at N hosts, where N is the replication factor configured “per-instance”.

Each key,k, is assigned to a coordinator node. The coordinator is in charge of the replication of the data

items that fall within its range. In addition to locally storing each key within its range, the coordinator replicates these keys at the N-1 nodes in the ring.

System Architecture – Placement

Map from Token Space to Nodes The first replica is always placed on the node the claims

the range in which the token falls Strategies determine where the rest of the replicas are

placed Cassandra provides the client with various options for how

data needs to be replicated. Cassandra provides various replication policies such as:

Rack Unaware Rack Aware (within a datacenter) Datacenter Aware


Rack Unaware

Place replicas on the N-1 subsequent nodes around the ring, ignoring topology.

If certain application chooses “Rack Unaware” replication strategy then the non-coordinator replicas are chosen by picking N-1 successors of the coordinator on the ring.


Rack Aware (within a datacenter)

Place the second replica in another datacenter, and the remaining N-2 replicas on nodes in other racks in the same datacenter.


Datacenter Aware

Place M of the N replicas in another datacenter, and the remaining N-M-1 replicas on nodes in other racks in the same datacenter.


1) Every node is aware of every other node in the system and hence the range they are responsible for. This is through Gossiping (not the leader).

2) A key is assigned to a node, that node is the key’s coordinator,who is responsible for replicating the item associated with the key on N-1 replicas in addition to itself.

3) Cassandra offers several replication policies and leaves it up to the application to choose one. These polices differ in the location of the selected Replicas. Rack Aware, Rack Unaware, Datacenter Aware are some of these polices.

4) Whenever a new node joins the system it contacts the Leader of the Cassandra, who tells the node what is the range for which it is responsible for replicating the associated keys.

5) Cassandra uses Zookeeper for maintaining the Leader.

6) The nodes that are responsible for the same range are called “Preference List” for that range. This terminology is borrowed from Dynamo.

System Architecture – Replication


Replication factor How many nodes data is replicated on

Consistency level Zero, One, Quorum, All Sync or async for writes Reliability of reads Read repair

System Architecture – Replication(Leader)

Cassandra system elects a leader amongst its nodes using a system called Zookeeper.

All nodes on joining the cluster contact the leader who tells them for what ranges they are replicas for and leader makes a concerted effort to maintain the invariant that no node is responsible for more than N-1 ranges in the ring.

The metadata about the ranges a node is responsible is cached locally at each node and in a fault-tolerant manner inside Zookeeper - this way a node that crashes and comes back up knows what ranges it was responsible for. We borrow from Dynamo parlance and deem the nodes that are responsible for a given range the “preference list” for the range.

System Architecture - Membership

Cluster membership in Cassandra is based on Scuttlebutt, a very ecient anti-entropy Gossip based mechanism.

System Architecture - Failure handling

Failure detection is a mechanism by which a node can locally determine if any other node in the system is up or down. In Cassandra failure detection is also used to avoid attempts to communicate with unreachable nodes during various operations.

Cassandra uses a modied version of the Accrual Failure Detector.

System Architecture - Bootstrapping

When a node starts for the first time, it chooses a random token for its position in the ring:

In Cassandra, joins and leaves of the nodes are initiated using an explicit mechanism, rather than an automatic one. A node is ordered to leave the Network, due to some malfunctioning observed in it. But it should be back soon. If a node leaves the network forever, then data partitioning is required. When a new node joins, data re-partitioning is required. As it frequently happens that the reason for adding a new node, is that some current nodes cannot any more handle all the load on them. So we add a new node and assign part of the range for which some heavily loaded nodes are currently responsible for. In this case data must be transferred between these two Replicas, the old and the new one. This is usually done after the administrator issues a new join. This should not shut the system down for this particular fraction of the range being transferred as there are hopefully other replicas having the same data. Once data is transferred to the new node, then the older node does not have that data any more

System Architecture - Scaling

When a new node is added into the system, it gets assigned a token such that it can alleviate a heavily loaded node

System Architecture - Scaling

System Architecture - Local Persistence

The Cassandra system relies on the local file system for data persistence.

System Architecture - Communication

Control messages use UDP;Application related messages like read/write

requests and replication requests are based on TCP.

Cassandra

Read & Write

Cassandra – Read/Write

Tunable Consistency - per read/write• One - Return once one replica responds success• Quorum - Return once RF/2 + 1 replicas respond• All - Return when all replicas respond

Want async replication?

Write = ONE, Read = ONE (Performance++) Want Strong consistency?

Read = QUORUM, Write = QUORUM Want Strong Consistency per DataCenter?

Read = LOCAL_QUORUM, write LOCAL_QUORUM

Cassandra – Read/Write

When a read or write request reaches at any node in the cluster the state machine morphs through the following states:

① The nodes that replicate the data for the key are identified.

② The request is forwarded to all the nodes and wait on the responses to arrive.

③ if the replies do not arrive within a congured timeout value fail the request and return to the client.

④ If replies received, figure out the latest response based on timestamp.

⑤ Update replicas with old data(schedule a repair of the data at any replica if they do not have the latest piece of data).

Cassandra - Read Repair

每次读取时都读取所有的副本只返回一个副本的数据对所有副本应用 Checksum或 Timestamp校验

如果存在不一致取出所有的数据并做合并将最新的数据写回到不同步（ out of sync) 的节点

Cassandra - Reads

Practically lock free Sstable proliferation New in 0.6:

Row cache (avoid sstable lookup, not write-through)

Key cache (avoid index scan)

Cassandra - Read

Any node Read repair Usual caching conventions apply

Read

Query

Closest replica

Cassandra Cluster

Replica A

Result

Replica B Replica C

Result

Client

Read repair if digests differRead repair if digests differ

Digest ResponseDigest Query

Digest Response

Cassandra - Write

No reads No seeks Sequential disk access Atomic within a column family Fast Any node Always writeable

Cassandra – Write(Properties)

No locks in the critical pathSequential disk accessBehaves like a write back CacheAppend support without read aheadAtomicity guarantee for a key“Always Writable”(accept writes during failure sc

enarios)

Cassandra - Writes

Commit log for durability Configurable fsync Sequential writes only

Memtable – no disk access (no reads or seeks)

SSTables are final (become read only) Indexes Bloom filter Raw data

Bottom line: FAST

Cassandra - Write

Cassandra - Write

The system can be congured to perform either synchronous or asynchronous writes.

For certain systems that require high throughput we rely on asynchronous replication.

Here the writes far exceed the reads that come into the system.

During the synchronous case we wait for a quorum of responses before we return a result to the client.

Cassandra - Write

Cassandra – Write(Fast)

fast writes: staged edaA general-purpose framework for high concurrency &

load conditioningDecomposes applications into stagesseparated by queuesAdopt a structured approach to event-driven

concurrency.

Cassandra – Write cont’d

Cassandra – Write(Compactions)

Cassandra – Gossip

Cassandra 是一个有单个节点组成的集群 – 其中没有“主”节点或单点故障 -因此，每个节点都必须积极地确认集群中其他节点的状态。它们使用一个称为闲话（ Gossip ）的机制来做此事 . 每个节点每秒中都会将集群中每个节点的状态“以闲话的方式传播”到 1-3 个其他节点 . 系统为闲话数据添加了版本 ,因此一个节点的任何变更都会快速地传播遍整个集群 . 通过这种方式 , 每个节点都能知道任一其他节点的当前状态 : 是在正在自举呢 , 还是正常运行呢 , 等。

Cassandra – Hinted Handoff

Cassandra 会存储数据的拷贝到 N 个节点 . 客户端可以根据数据的重要性选择一个一致性级别 (Consistency level),例如 , QUORUM 表示 , 只有这 N 个节点中的多数返回成功才表示这个写操作成功。如果这些节点中的一个宕机了 , 会发生什么呢 ? 写操作稍后将如何传递到此节点呢 ?

Cassandra 使用了提示移交 (Hinted Handoff) 的技术来解决此问题 , 其中数据会被写入并保存到另一个随机节点 X, 并提示这些数据需要被保存到节点 Y, 并在节点重新在线时进行重放 ( 记住 ,当节点 Y变成在线时 ,闲话机制会快速通知 X 节点 ). 提示移交可以确保节点 Y 可以快速的匹配上集群中的其他节点 .注意 ,如果提示移交由于某种原因没有起作用 , 读修复最终仍然会“修复”这些过期数据，不过只有当客户端访问这些数据时才会进行读修复。提示的写是不可读的 (因为节点 X 并不是这 N 份拷贝的其中一个正式节点 ),因此 ,它们并不会记入写一致性 .如果Cassandra 的配置了 3 份拷贝 ,而其中的两个节点不可用 , 就不可能实现一个 QUORUM 的写操作。

Cassandra – Anti-entropy

Cassandra 的一个众所周知的秘密武器是逆熵 (Anti-entropy).逆熵明确保证集群中的节点一致认可当前数据 .如果由于默认情况 , 读修复 (read repair) 与提示移交 (hinted handoff)都没有生效 ,逆熵会确保节点达到最终一致性 .逆熵服务是在“主压缩” ( 等价与关系数据库中的重建表 ) 时运行的 ,因此，它是一个相对重量级但运行不频繁的进程 .逆熵使用Merkle树 ( 也称为散列树 )来确定节点在列族 (column family) 数据树内的什么位置不能一致认可 ,接着修复该位置的每一个分支。

Cassandra

Other

Other - Gossip

Other - DHT

DHTs(Distributed hash tables) : A DHT is a class of a decentralized distributed system that provides a lookup service similar to a hash table; (key, value) pairs are stored in a DHT, and any participating node can efficiently retrieve the value associated with a given key;

DHTs form an infrastructure that can be used to build more complex services, such as anycast, cooperative Web caching, distributed file systems, domain name services, instant messaging, multicast, and also peer-to-peer file sharing and content distribution systems.

http://en.wikipedia.org/wiki/Distributed_hash_table

Other - DHT

Other - Cassandra - Domain Models

Other -

Other - Bloom filter

An example of a Bloom filter, representing the set {x, y, z}. The colored arrows show the positions in the bit array that each set element is mapped to. The element w is not in the set {x, y, z}, because it hashes to one bit-array position containing 0. For this figure, m=18 and k=3. http://en.wikipedia.org/wiki/Bloom_filter


Bloom filter used to speed up answers in a key-value storage system. Values are stored on a disk which has slow access times. Bloom filter decisions are much faster. However some unnecessary disk accesses are made when the filter reports a positive (in order to weed out the false positives). Overall answer speed is better with the Bloom filter than without the Bloom filter. Use of a Bloom filter for this purpose, however, does increase memory usage. 。

Other - Timestamps and Vector Clocks

Eventual consistency relies on deciding what value a row will eventually converge to;

In the case of two writers writing at “the same" time, this is difficult;

Timestamps are one solution, but rely on synchronized clocks and don't capture causality;

Vector clocks are an alternative method of capturing order in a distributed system.

Other - Vector Clocks

Definition A vector clock is a tuple {T1, T2, … …, TN} of clock

values from each node V1 < V2 if:

• For all I , V1I <= V2I

• For at least one I , V1I < V2I V1 < V2 implies global time ordering of events

When data is written from node I , it sets TI to its clock value.

This allows eventual consistency to resolve consistency between writes on multiple replicas.

Other - CommitLog

和关系型数据库系统一样， Cassandra 也是采用的先写日志再写数据的方式，其日志称之为 Commitlog 。和Memtable/SSTable 不一样的是， Commitlog是 server级别的，不是 Column Family 级别的。每个 Commitlog文件的大小是固定的，称之为一个 Commitlog Segment ，目前版本 (0.5.1) 中，这个大小是 128MB ，这是硬编码在代码 (src\java\org\apache\cassandra \db\Commitlog.java)中的。当一个 Commitlog文件写满以后，会新建一个的文件。当旧的 Commitlog文件不再需要时，会自动清除 .

Other - CommitLog

每个 Commitlog文件 (Segment)都有一个固定大小（大小根据Column Family 的数目而定）的 CommitlogHeader结构，其中有两个重要的数组，每一个 Column Family 在这两个数组中都存在一个对应的元素。其中一个是位图数组 (BitSet dirty) ，如果 Column Family 对应的 Memtable 中有脏数据，则置为 1 ，否则为 0 ，这在恢复的时候可以指出哪些Column Family 是需要利用 Commitlog 进行恢复的。另外一个是整数数组 (int[] lastFlushedAt) ，保存的是 Column Family 在上一次 Flush 时日志的偏移位置，恢复时则可以从这个位置读取 Commitlog 记录。通过这两个数组结构， Cassandra 可以在异常重启服务的时候根据持久化的SSTable和 Commitlog重构内存中 Memtable 的内容，也就是类似 Oracle 等关系型数据库的实例恢复 .

Other - CommitLog

当Memtable flush 到磁盘的 SStable 时，会将所有 Commitlog文件的 dirty 数组对应的位清零，而在 Commitlog达到大小限制创建新的文件时， dirty 数组会从上一个文件中继承过来。如果一个 Commitlog文件的 dirty 数组全部被清零，则表示这个 Commitlog 在恢复的时候不再需要，可以被清除。因此，在恢复的时候，所有的磁盘上存在的 Commitlog文件都是需要的 .

http://wiki.apache.org/cassandra/ArchitectureCommitLog

http://www.ningoo.net/html/2010/cassandra_commitlog.html

Cassandra

The End

storage cassandra

Technology

cassandra dynamo overview

bigtable cassandra

cassandra overview cassandra

dynamo gossip

dynamo merkle dynamo

cassandra overview

cassandra roc

authors of amazons dynamo