cassandra - lesson learned
TRANSCRIPT
Cassandra - lesson learned
Andrzej Ludwikowski
About me?- http://aludwikowski.blogspot.com/- https://github.com/aludwiko- @aludwikowski- SoftwareMill
Why cassandra?- BigData!!!
- Volume (petabytes of data, trillions of entities)- Velocity (real-time, streams, millions of transactions per second)- Variety (un-, semi-, structured)
- Near-linear horizontal scaling (in proper use cases)- Fully distributed, with no single point of failure
- Data replication- By default
What is cassandra vs CAP?- CAP Theorem - pick two
What is cassandra vs CAP?- CAP Theorem - pick two
What is cassandra vs CAP?- CAP Theorem - pick two
Origins?
2010
Name?
Name?
Write path
Node 1
Node 2
Node 3
Node 4
Client (driver)
Write path
Node 1
Node 2
Node 3
Node 4
Client (driver)
- Any node can coordinate any request (NSPOF)
- Any node can coordinate any request (NSPOF)- Replication Factor
Write path
Node 1
Node 2
Node 3
Node 4
Client
RF=3
- Any node can coordinate any request (NSPOF)- Replication Factor- Consistency Level
Write path
Node 1
Node 2
Node 3
Node 4
Client
RF=3
CL=2
- Token ring from -2^63 to 2^64
Write path - consistent hashing
Node 1
Node 2
Node 3
Node 4
0100
- Token ring from -2^63 to 2^64 - Partitioner: partition key -> token
Write path - consistent hashing
Node 1
Node 2
Node 3
Node 4
Client
Partitioner
0-25
25-5051-75
76-10077
- Token ring from -2^63 to 2^64 - Partitioner: primary key -> token
Write path - consistent hashing
Node 1
Node 2
Node 3
Node 4
Client
Partitioner
0-25
25-5051-75
76-100
77
- Token ring from -2^63 to 2^64 - Partitioner: primary key -> token
Write path - consistent hashing
Node 1
Node 2
Node 3
Node 4
Client
Partitioner
0-25
25-5051-75
76-100
77
77
77
- Token ring from -2^63 to 2^64 - Partitioner: primary key -> token
Write path - consistent hashing
Node 1
Node 2
Node 3
Node 4
Client
0-25
Partitioner
77
25-5051-75
76-100
77
77
DEMO
Write path - problems?
Node 1
Node 2
Node 3
Node 4
Client
0-2577
25-5051-75
76-100
77
77
- Hinted handoff
Write path - problems?
Node 1
Node 2
Node 3
Node 4
Client
0-2577
25-5051-75
76-100
77
77
- Hinted handoff- Retry idempotent inserts
- build-in policies
Write path - problems?
Node 1
Node 2
Node 3
Node 4
Client
0-2577
25-5051-75
76-100
77
77
- Hinted handoff- Retry idempotent inserts
- build-in policies
- Lightweight transactions (Paxos)
Write path - problems?
Node 1
Node 2
Node 3
Node 4
Client
0-2577
25-5051-75
76-100
77
77
- Hinted handoff- Retry idempotent inserts
- build-in policies
- Lightweight transactions (Paxos)- Batches
Write path - problems?
Node 1
Node 2
Node 3
Node 4
Client
0-2577
25-5051-75
76-100
77
77
Write path - node level
Write path - why so fast?- Commit log - append only
Write path - why so fast?
Write path - why so fast?
50,000 t/s 50 t/ms 5 t/100us 1 t/20us
Write path - why so fast?- Commit log - append only- Periodic (10s) or batch sync to disk
Node 1
Node 2
Node 3
Node 4
Client
RF=2
CL=2
Dasdd Rack 2
Rack 1
Write path - why so fast?- Commit log - append only- Periodic or batch sync to disk- Network topology aware
Node 1
Node 2
Node 3
Node 4
Client
RF=2
CL=2
Write path - why so fast?
Client
- Commit log - append only- Periodic or batch sync to disk- Network topology aware
Asia DC
Europe DC
- Most recent win- Eager retries- In-memory
- MemTable- Row Cache- Bloom Filters- Key Caches- Partition Summaries
- On disk- Partition Indexes- SSTables
Node 1
Node 2
Node 3
Node 4
Client
RF=3
CL=3
Read path
timestamp 67
timestamp 99
timestamp 88
Immediate vs. Eventual Consistency- if (writeCL + readCL) > replication_factor then immediate consistency- writeCL=ALL, readCL=1- writeCL=1, readCL=ALL- writeCL,readCL=QUORUM- If "stale" is measured in milliseconds,
how much are those milliseconds worth?
Node 1
Node 2
Node 3
Node 4
Client
RF=3
Modeling - new mindset- QDD, Query Driven Development- Nesting is ok- Duplication is ok- Writes are cheap
QDD - Conceptual model- Technology independent- Chen notation
QDD - Application workflow
QDD - Logical model
- Chebotko diagram
QDD - Physical model
- Technology dependent- Analysis and validation (finding problems)- Physical optimization (fixing problems)- Data types
Physical storage
- Primary key- Partition key
CREATE TABLE videos ( id int, title text, runtime int, year int, PRIMARY KEY (id));
id | title | runtime | year----+---------------------+---------+------ 1 | dzien swira | 93 | 2002 2 | chlopaki nie placza | 96 | 2000 3 | psy | 104 | 1992 4 | psy 2 | 96 | 1994
1title runtime year
dzien swira 93 2002
2title runtime year
chlopaki... 96 2000
3title runtime year
psy 104 1992
4title runtime year
psy 2 96 1994
SELECT FROM videosWHERE title = ‘dzien swira’
Physical storage
CREATE TABLE videos_with_clustering ( title text, runtime int, year int, PRIMARY KEY ((title), year));
- Primary key (could be compound)- Partition key- Clustering column (order, uniqueness)
title | year | runtime-------------+------+--------- godzilla | 1954 | 98 godzilla | 1998 | 140 godzilla | 2014 | 123 psy | 1992 | 104
godzilla1954 runtime
98
1998 runtime
140
2014 runtime
123
1992 runtime
104psy
SELECT FROM videos_with_clusteringWHERE title = ‘godzilla’
Physical storage
CREATE TABLE videos_with_composite_pk( title text, runtime int, year int, PRIMARY KEY ((title, year)));
- Primary key (could be compound)- Partition key (could be composite)- Clustering column (order, uniqueness)
title | year | runtime-------------+------+--------- godzilla | 1954 | 98 godzilla | 1998 | 140 godzilla | 2014 | 123 psy | 1992 | 104
godzilla:1954runtime
93
godzilla:1998runtime
140
godzilla:2014runtime
123
psy:1992runtime
104
SELECT FROM videos_with_composite_pkWHERE title = ‘godzilla’AND year = 1954
Modeling - clustering column(s)
Q: Retrieve videos an actor has appeared in (newest first).
Modeling - clustering column(s)
CREATE TABLE videos_by_actor ( actor text, added_date timestamp, video_id timeuuid, character_name text, description text, encoding frozen<video_encoding>, tags set<text>, title text, user_id uuid, PRIMARY KEY ( )) WITH CLUSTERING ORDER BY ( );
Q: Retrieve videos an actor has appeared in (newest first).
Modeling - clustering column(s)
CREATE TABLE videos_by_actor ( actor text, added_date timestamp, video_id timeuuid, character_name text, description text, encoding frozen<video_encoding>, tags set<text>, title text, user_id uuid, PRIMARY KEY ((actor), added_date)) WITH CLUSTERING ORDER BY (added_date desc);
Q: Retrieve videos an actor has appeared in (newest first).
Modeling - clustering column(s)
CREATE TABLE videos_by_actor ( actor text, added_date timestamp, video_id timeuuid, character_name text, description text, encoding frozen<video_encoding>, tags set<text>, title text, user_id uuid, PRIMARY KEY ((actor), added_date, video_id)) WITH CLUSTERING ORDER BY (added_date desc);
Q: Retrieve videos an actor has appeared in (newest first).
Modeling - clustering column(s)
CREATE TABLE videos_by_actor ( actor text, added_date timestamp, video_id timeuuid, character_name text, description text, encoding frozen<video_encoding>, tags set<text>, title text, user_id uuid, PRIMARY KEY ((actor), added_date, video_id, character_name)) WITH CLUSTERING ORDER BY (added_date desc);
Q: Retrieve videos an actor has appeared in (newest first).
Modeling - compound partition key
CREATE TABLE temperature_by_day ( weather_station_id text, date text, event_time timestamp, temperature text PRIMARY KEY ( )) WITH CLUSTERING ORDER BY ( );
Q: Retrieve last 1000 measurement from given day.
Modeling - compound partition key
CREATE TABLE temperature_by_day ( weather_station_id text, date text, event_time timestamp, temperature text PRIMARY KEY ((weather_station_id), date, event_time)) WITH CLUSTERING ORDER BY (event_time desc);
Q: Retrieve last 1000 measurement from given day.
1 day = 86 400 rows1 week = 604 800 rows1 month = 2 592 000 rows1 year = 31 536 000 rows
Modeling - compound partition key
CREATE TABLE temperature_by_day ( weather_station_id text, date text, event_time timestamp, temperature text PRIMARY KEY ((weather_station_id, date), event_time)) WITH CLUSTERING ORDER BY (event_time desc);
Q: Retrieve last 1000 measurement from given day.
Modeling - TTL
CREATE TABLE temperature_by_day ( weather_station_id text, date text, event_time timestamp, temperature text PRIMARY KEY ((weather_station_id, date), event_time)) WITH CLUSTERING ORDER BY (event_time desc);
Retention policy - keep data only from last week.
INSERT INTO temperature_by_day … USING TTL 604800;
Modeling - bit map index
CREATE TABLE car ( year timestamp, model text, color timestamp, vehicle_id int, //other columns PRIMARY KEY ((year, model, color), vehicle_id));
Q: Find car by year and/or model and/or color.
INSERT INTO car (year, model, color, vehicle_id, ...) VALUES (2000, 'Multipla', 'blue', 13, ...);INSERT INTO car (year, model, color, vehicle_id, ...) VALUES (2000, 'Multipla', '', 13, ...);INSERT INTO car (year, model, color, vehicle_id, ...) VALUES (2000, '', 'blue', 13, ...);INSERT INTO car (year, model, color, vehicle_id, ...) VALUES (2000, '', '', 13, ...);INSERT INTO car (year, model, color, vehicle_id, ...) VALUES ('', 'Multipla', 'blue', 13, ...);INSERT INTO car (year, model, color, vehicle_id, ...) VALUES ('', 'Multipla', '', 13, ...);INSERT INTO car (year, model, color, vehicle_id, ...) VALUES ('', '', 'blue', 13, ...);
SELECT * FROM car WHERE year=2000 and model=’’ and color=’blue’;
Modeling - wide rows
CREATE TABLE user ( email text, name text, age int, PRIMARY KEY (email));
Q: Find user by email.
Modeling - wide rows
CREATE TABLE user ( domain text, user text, name text, age int, PRIMARY KEY ((domain), user));
Q: Find user by email.
Modeling - versioning with lightweight transactions
CREATE TABLE document ( id text, content text, version int, locked_by text, PRIMARY KEY ((id)));
INSERT INTO document (id, content , version ) VALUES ( 'my doc', 'some content', 1) IF NOT EXISTS;
UPDATE document SET locked_by = 'andrzej' WHERE id = 'my doc' IF locked_by = null;
UPDATE document SET content = 'better content', version = 2, locked_by = null WHERE id = 'my doc' IF locked_by = 'andrzej';
Modeling - JSON with UDT and tuples
{"title": "Example Schema","type": "object","properties": {
"firstName": “andrzej”,"lastName": “ludwikowski”,"age": {
"description": "Age in years","type": "integer","minimum": 0
}},“x_dimension”: “1”,
“y_dimension”: “2”,}
CREATE TYPE age ( description text, type int, minimum int);
CREATE TYPE prop ( firstName text, lastName text, age frozen <age>);
CREATE TABLE json ( title text, type text, properties list<frozen <prop>>, dimensions tuple<int, int> PRIMARY KEY (title));
Common use cases
- Sensor data (Zonar)- Fraud detection (Barracuda)- Playlist and collections (Spotify)- Personalization and recommendation engines (Ebay)- Messaging (Instagram)
Common anti use cases
- Queue- Search engine