Download - Cooking Cassandra

Transcript
Page 1: Cooking Cassandra

Scalable eCommerce Platform Solutions

Scalable eCommerce Platform Solutions

Apache Cassandra

high level overview and lessons learned

Page 2: Cooking Cassandra

Scalable eCommerce Platform SolutionsScalable eCommerce Platform Solutions

Apache Cassandra

22/14/14

Page 3: Cooking Cassandra

Scalable eCommerce Platform Solutions

Highlights

• Distributed columnar family database • No SPOF • decentralized • data is both partitioned and replicated

• Optimized for high write throughput • Query time tunable A vs C in CAP • SEDA

3

2/14/14

Page 4: Cooking Cassandra

Scalable eCommerce Platform Solutions

Partitioning (Consistent Hashing)

4

2/14/14

Page 5: Cooking Cassandra

Scalable eCommerce Platform Solutions

Replication (RF 3)

5

2/14/14

Page 6: Cooking Cassandra

Scalable eCommerce Platform Solutions

Adding New Node

6

2/14/14

Page 7: Cooking Cassandra

Scalable eCommerce Platform Solutions

Partitioning (MOD N)

7

2/14/14

Node 1 Node 2 Node 3

0 1 23 4 5

6 7 8

Node 1 Node 2 Node 3 Node 4

0 1 2 34 5 6 7

8

Page 8: Cooking Cassandra

Scalable eCommerce Platform Solutions

Virtual Nodes

8

2/14/14

• Going from one token and range per node to many tokens per node

• No manual assignments of tokens to nodes • Load is evenly distributed when a node joins

and leaves cluster • Improves the use of heterogeneous machines in

a cluster

Page 9: Cooking Cassandra

Scalable eCommerce Platform Solutions

Key Data Distribution Components

• Partitioner calculates token by a row key (determines where to place first replica of a row)

• Replication Strategy determines total number of replicas and where to place them

• Snitch defines network topology such as location of nodes grouping them by racks and data centers. Used by • Replication Strategy • Routing Requests (+Dynamic Snitch)

9

2/14/14

Page 10: Cooking Cassandra

Scalable eCommerce Platform Solutions

Write Requests

• A coordinator node sends a write request to all replicas regardless of Consistency Level (CL)

• It acknowledges request when CL is satisfied

10

2/14/14

Page 11: Cooking Cassandra

Scalable eCommerce Platform Solutions

Read Requests - Optimistic Flow

• A coordinator node sends direct read requests to CL number of fastest replicas (Dynamic Snitch) • 1 request for full read • CL - 1 requests for digest reads

• If there is a match it is returned to client • Background read repair requests are sent to

other owners of that row based on read repair chance

11

2/14/14

Page 12: Cooking Cassandra

Scalable eCommerce Platform Solutions

Read Requests - Mismatch Case

• If there is a mismatch a coordinator node sends direct full read requests to CL number of those replicas

• Most recent copy returned to client

12

2/14/14

Page 13: Cooking Cassandra

Scalable eCommerce Platform Solutions

Write Path!!!!!!

• Flush to disk is when memtable size threshold or commit log size threshold or heap utilization threshold reached

• Never random disk IO or modification in place • Compaction is in background • A delete just marks a column with a tombstone

13

2/14/14

!• commit log contains

all mutations • memtable keeps

track of latest version of data

Page 14: Cooking Cassandra

Scalable eCommerce Platform Solutions

Read Path!!!!!!!!!!!

• Each SSTable is read, results are combined with unflushed memtable(s), latest version returned

• KeyCache is fixed size and shared among all tables • are stored off heap (v1.2.X)

14

2/14/14

Page 15: Cooking Cassandra

Scalable eCommerce Platform Solutions

ACID• Atomicity

• a write is atomic at the row-level • doesn’t roll back if a write fails on some replicas

• Consistency • tunable through CL requirements (C vs A) • Strong Consistency W + R > N

• Isolation • row-level

• Durability • yes, but • commit log fsync each 10 seconds by default

• Lightweight transactions in Cassandra 2.0 • For INSERT, UPDATE statements • using IF clause

15

2/14/14

Page 16: Cooking Cassandra

Scalable eCommerce Platform Solutions

Built-in Repair Tools

• Hinted handoff • does no count towards CL requirement • if CL.ANY is used, not readable until at least

one normal owner is recovered • Read repair • Anti-entropy node repair

16

2/14/14

Page 17: Cooking Cassandra

Scalable eCommerce Platform SolutionsScalable eCommerce Platform Solutions

Data Modeling

172/14/14

Page 18: Cooking Cassandra

Scalable eCommerce Platform Solutions

Data Modeling

• Read by partition key • Reduce number of reads • aggregate data used together in a single row • even at expense of number of writes to

duplicate some data • Writes should not depend on reads • Keep metadata overhead low

18

2/14/14

Page 19: Cooking Cassandra

Scalable eCommerce Platform Solutions

CQL3 Overview

• It looks like SQL • Compound keys • Standard data types are built-in • Collection type • Asynchronous queries • Tracing of queries • … and more

19

2/14/14

Page 20: Cooking Cassandra

Scalable eCommerce Platform Solutions

Simple Row / CQL3CREATE TABLE simple_table (

my_key int PRIMARY KEY,

my_field_1 text,

my_field_2 boolean

);

!INSERT INTO simple_table (my_key, my_field_1, my_field_2) VALUES ( 1, 'my value 1', false);

INSERT INTO simple_table (my_key, my_field_1, my_field_2) VALUES ( 2, 'my value 2', true);

!SELECT * FROM simple_table ;

! my_key | my_field_1 | my_field_2 --------+------------+------------ 1 | my value 1 | False 2 | my value 2 | True

20

2/14/14

Page 21: Cooking Cassandra

Scalable eCommerce Platform Solutions

Simple Row / Internal[default@test] list simple_table;

-------------------

RowKey: 1

=> (name=, value=, timestamp=1395180822477000)

=> (name=my_field_1, value=6d792076616c75652031, timestamp=1395180822477000)

=> (name=my_field_2, value=00, timestamp=1395180822477000)

-------------------

RowKey: 2

=> (name=, value=, timestamp=1395180822480000)

=> (name=my_field_1, value=6d792076616c75652032, timestamp=1395180822480000)

=> (name=my_field_2, value=01, timestamp=1395180822480000)

!1. Column name (size is proportional to column name length) and timestamp is stored for each column

2. There is an additional “empty” column per row

21

2/14/14

Page 22: Cooking Cassandra

Scalable eCommerce Platform Solutions

Compound Key / CQL3

22

2/14/14

CREATE TABLE compound_key_table (

my_part_key int,

my_clust_key text,

my_field int,

PRIMARY KEY (my_part_key, my_clust_key)

);

!INSERT INTO compound_key_table (my_part_key, my_clust_key, my_field) VALUES ( 1, 'my value 2', 2);

INSERT INTO compound_key_table (my_part_key, my_clust_key, my_field) VALUES ( 1, 'my value 1', 1);

INSERT INTO compound_key_table (my_part_key, my_clust_key, my_field) VALUES ( 1, 'my value 3', 3);

SELECT * FROM compound_key_table ;

! my_part_key | my_clust_key | my_field -------------+--------------+---------- 1 | my value 1 | 1 1 | my value 2 | 2 1 | my value 3 | 3

Page 23: Cooking Cassandra

Scalable eCommerce Platform Solutions

Compound Key / Internal

23

2/14/14

[default@test] list compound_key_table;

-------------------

RowKey: 1

=> (name=my value 1:, value=, timestamp=1395192704575000)

=> (name=my value 1:my_field, value=00000001, timestamp=1395192704575000)

=> (name=my value 2:, value=, timestamp=1395192704572000)

=> (name=my value 2:my_field, value=00000002, timestamp=1395192704572000)

=> (name=my value 3:, value=, timestamp=1395192704577000)

=> (name=my value 3:my_field, value=00000003, timestamp=1395192704577000)

!1. Both CQL3 rows are in the same physical row, thus single read operation can read both of them

2. Still can read or update them partially (need to know PK - use lookup table)

3. Value of ‘my_clust_key’ column joined with ‘my_field’ column name and becomes my_field’s value column name

4. Value of ‘my_clust_key’ value doesn’t have associated timestamp, since it is part of PK

5. The CQL3 rows are sorted by value of ‘my_clust_key’ and can be used in ‘where’ clause

6. There is an additional “empty” column per CQL3 row

7. PK column names are hidden in system.schema_columnfamilies

Page 24: Cooking Cassandra

Scalable eCommerce Platform Solutions

Collection Type / CQL3

24

2/14/14

CREATE TABLE collection_type_table (

my_key int PRIMARY KEY,

my_set set<int>,

my_map map<int, int>,

my_list list<int>,

);

!INSERT INTO collection_type_table (my_key, my_set, my_map, my_list)

VALUES ( 1, {1, 2}, {1:2, 3:4}, [1, 2]);

SELECT * FROM collection_type_table ;

! my_key | my_list | my_map | my_set --------+---------+--------------+-------- 1 | [1, 2] | {1: 2, 3: 4} | {1, 2}

Page 25: Cooking Cassandra

Scalable eCommerce Platform Solutions

Collection Type / Internal

25

2/14/14

[default@test] list collection_type_table;

-------------------

RowKey: 1

=> (name=, value=, timestamp=1395253516706000)

=> (name=my_list:d1da8820af9311e38f4e97aee9b28d0c, value=00000001, timestamp=1395253516706000)

=> (name=my_list:d1da8821af9311e38f4e97aee9b28d0c, value=00000002, timestamp=1395253516706000)

=> (name=my_map:00000001, value=00000002, timestamp=1395253516706000)

=> (name=my_map:00000003, value=00000004, timestamp=1395253516706000)

=> (name=my_set:00000001, value=, timestamp=1395253516706000)

=> (name=my_set:00000002, value=, timestamp=1395253516706000)

!1. Each element of each collection gets its own column

2. Each element of List type additionally consumes 16 bytes to maintain order of elements

3. Map key goes to column name

4. Set value goes to column name

Page 26: Cooking Cassandra

Scalable eCommerce Platform Solutions

Column Overhead

• name : 2 bytes (length as short int) + byte[] • flags : 1 byte • if counter column : 8 bytes (timestamp of last

delete) • if expiring column : 4 bytes (TTL) + 4 bytes

(local deletion time) • timestamp : 8 bytes (long) • value : 4 bytes (len as int) + byte[]

26

2/14/14

http://btoddb-cass-storage.blogspot.ru/2011/07/column-overhead-and-sizing-every-column.html

Page 27: Cooking Cassandra

Scalable eCommerce Platform Solutions

Metadata Overhead• Simple case (no TTL or not a Counter column ): • regular_column_size = column_name_size +

column_value_size + 15 bytes • row has has 23 bytes of overhead

• A column with name “my_column” of type int stores your 4 bytes and incurs 24 bytes of overhead

• Keep in mind when internal columns created for CQL3 structures like Compound Keys or Collection Types

• Keep in mind when column value is used as column name for many other columns

27

2/14/14

Page 28: Cooking Cassandra

Scalable eCommerce Platform Solutions

JSON vs Separate Columns

• Drastically reduces metadata overhead • A column with name “my_column” of type

text which stores your 1 kB bytes JSON object and incurs 24 bytes of overhead sounds much better!

• Saves CPU cycles and reduces read latency • Supports complex hierarchical structures • But it loses in partial reads / updates and

complicates schema versioning28

2/14/14

Page 29: Cooking Cassandra

Scalable eCommerce Platform Solutions

Use Case 1: Products and Upcs

29

2/14/14

CREATE TABLE product (

pid int,

upc int,

value text,

rstat text,

PRIMARY KEY(pid, uid)

);

! pid | upc | rstat | value -----+-----+---------------------+--------------------- 123 | 0 | Reviews JSON Object | Product JSON Object 123 | 456 | null | Upc JSON Object 123 | 789 | null | Upc JSON Object

Page 30: Cooking Cassandra

Scalable eCommerce Platform Solutions

Use Case 2: Availability

30

2/14/14

CREATE TABLE online_inventory (

pid int, upc int, available boolean,

PRIMARY KEY (pid, upc)

);

!INSERT INTO online_inventory (pid, upc, available, tmp)

VALUES ( 123, 456, true, 0) USING TIMESTAMP 5;

INSERT INTO online_inventory (pid, upc, available, tmp)

VALUES ( 123, 456, false, 0) USING TIMESTAMP 4;

! pid | upc | available | writetime(available) -----+-----+-----------+---------------------- 123 | 456 | True | 5

Page 31: Cooking Cassandra

Scalable eCommerce Platform Solutions

Use Case 3: Product Pagination

31

2/14/14

CREATE TABLE product_pagination (

filter text,

pid int,

PRIMARY KEY (filter, pid)

)

!INSERT INTO product_pagination (filter, pid ) VALUES ( 'ACTIVE', 45);

INSERT INTO product_pagination (filter, pid ) VALUES ( 'ACTIVE', 25);

INSERT INTO product_pagination (filter, pid ) VALUES ( 'ACTIVE', 75);

INSERT INTO product_pagination (filter, pid ) VALUES ( 'ACTIVE', 15);

SELECT * FROM product_pagination where filter = 'ACTIVE' and pid > 15 limit 2 ;

! filter | pid --------+----- ACTIVE | 25 ACTIVE | 45

Page 32: Cooking Cassandra

Scalable eCommerce Platform SolutionsScalable eCommerce Platform Solutions

DataStax Java Driver

322/14/14

Page 33: Cooking Cassandra

Scalable eCommerce Platform Solutions

DataStax Java Driver• Flexible load balancing policies

• includes token aware load balancing • Connection pooling • Flexible retry policy

• can retry on other nodes • or reduce CL requirement

• Non-blocking I/O • up to 128 simultaneous requests per connection • asynchronous API

• Nodes discovery

33

2/14/14

Page 34: Cooking Cassandra

Scalable eCommerce Platform Solutions

Multi-gets• When you have N keys and want to read them all • Built-in token-aware load balancer evaluates the first

key and sends all N keys to that node! oops… • We preferred sending N fine-grained single-get queries in

async mode • retries only those which failed • can return partial result • smart route for each key

• We tried multi-get-aware token-aware load balancer • worked worse

34

2/14/14

Page 35: Cooking Cassandra

Scalable eCommerce Platform SolutionsScalable eCommerce Platform Solutions

Data Loader

352/14/14

Page 36: Cooking Cassandra

Scalable eCommerce Platform Solutions

Data Loader

36

2/14/14

• partitions the whole data set (MOD N)

• sorts all result sets by product id

• accumulates assembled products and executes batch write to C*

• single connection per reader thread

Page 37: Cooking Cassandra

Scalable eCommerce Platform SolutionsScalable eCommerce Platform Solutions

Cassandra 1.2.X Known Issues

372/14/14

Page 38: Cooking Cassandra

Scalable eCommerce Platform Solutions

OOM #1

• select count (*) from product limit 75000000; • wait for timeout • hmm, try again (arrow up, enter) • select count (*) from product limit 75000000; • wait for timeout • again

38

2/14/14

Page 39: Cooking Cassandra

Scalable eCommerce Platform Solutions

OOM #2

• Try the following in production and get permanent vacation • truncate, drop, create table • load data there • start light read load

• Up to all C* nodes can get OOM simultaneously • That is called high availability!

39

2/14/14

Page 40: Cooking Cassandra

Scalable eCommerce Platform Solutions

DROP/CREATE without TRUNCATE

• SSTable files are still on disk after DROP • CREATE triggers reading of the files • and C* fails…

40

2/14/14


Top Related