learning cassandra

58
Learning Cassandra Dave Gardner @davegardnerisme

Upload: dave-gardner

Post on 15-Jan-2015

10.099 views

Category:

Technology


4 download

DESCRIPTION

Context to choosing NoSQL, learning Cassandra basics plus some basic data modelling patterns and anti-patterns.

TRANSCRIPT

Page 1: Learning Cassandra

Learning Cassandra

Dave Gardner@davegardnerisme

Page 2: Learning Cassandra

What I’m going to cover

• How to NoSQL• Cassandra basics (dynamo and

big table)• How to use the data model in

real life

Page 3: Learning Cassandra

How to NoSQL

1. Find data store that doesn’t use SQL2. Anything3. Cram all the things into it4. Triumphantly blog this success5. Complain a month later when it

bursts into flames

http://www.slideshare.net/rbranson/how-do-i-cassandra/4

Page 4: Learning Cassandra

Choosing NoSQL

“NoSQL DBs trade off traditional features to better support new and emerging use cases”

http://www.slideshare.net/argv0/riak-use-cases-dissecting-the-sol

utions-to-hard-

problems

Page 5: Learning Cassandra

Choosing Cassandra: Tradeoffs

More widely used, tested and documented softwareMySQL first OS release 1998

For a relatively immature productCassandra first open-sourced in 2008

Page 6: Learning Cassandra

Choosing Cassandra: Tradeoffs

Ad-hoc queryingSQL join, group by, having, order

For a rich data model with limited ad-hoc querying abilityCassandra makes you denormalise

Page 7: Learning Cassandra

Choosing NoSQL

“they say … I can’t decide between this project and this project even though they look nothing like each other. And the fact that you can’t decide indicates that you don’t actually have a problem that requires them.”

Benjamin Black – NoSQL Tapes (at 30:15)

http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip

Page 8: Learning Cassandra

What do we get in return?

Proven horizontal scalability

Cassandra scales reads and writes linearly as new nodes are added

Page 10: Learning Cassandra

What do we get in return?

High availability

Cassandra is fault-resistant with tunable consistency levels

Page 11: Learning Cassandra

What do we get in return?

Low latency, solid performance

Cassandra has very good write performance

Page 12: Learning Cassandra

http://blog.cubrid.org/dev-platform/nosql-benchmarking/

* Add pinch of salt

Performance benchmark *

Page 13: Learning Cassandra

What do we get in return?

Operational simplicity

Homogenous cluster, no “master” node, no SPOF

Page 14: Learning Cassandra

What do we get in return?

Rich data model

Cassandra is more than simple key-value – columns, composites, counters, secondary indexes

Page 15: Learning Cassandra

How to NoSQL version 2

Learn about each solution

• What tradeoffs are you making?• How is it designed?• What algorithms does it use?

http://www.alberton.info/nosql_databases_what_when_why_phpuk2011.html

Page 16: Learning Cassandra

Amazon Dynamo + Google Big Table

Consistent hashingVector clocks *Gossip protocolHinted handoffRead repair

http://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

ColumnarSSTable storage

Append-onlyMemtable

Compaction

http://labs.google.com/papers/bigtable-osdi06.pdf

* not in Cassandra

Page 17: Learning Cassandra

The dynamo paper

#1

#4

#6

#2

#3

Client

#5

tokens are integers from0 to 2127

Page 18: Learning Cassandra

The dynamo paper

#1

#4

#6

#2

#3

Client

#5

Coordinator

consistent hashing

Page 19: Learning Cassandra

Consistency levels

How many replicas must respond to declare success?

Page 20: Learning Cassandra

Consistency levels: read operations

Level Description

ONE 1st Response

QUORUM N/2 + 1 replicas

LOCAL_QUORUM N/2 + 1 replicas in local data centre

EACH_QUORUM N/2 + 1 replicas in each data centre

ALL All replicas

http://wiki.apache.org/cassandra/API#Read

Page 21: Learning Cassandra

Consistency levels: write operations

Level Description

ANY One node, including hinted handoff

ONE One node

QUORUM N/2 + 1 replicas

LOCAL_QUORUM N/2 + 1 replicas in local data centre

EACH_QUORUM N/2 + 1 replicas in each data centre

ALL All replicas

http://wiki.apache.org/cassandra/API#Write

Page 22: Learning Cassandra

The dynamo paper

RF = 3CL = One

#1

#4

#6

#2

#3

Client

#5

Coordinator

Page 23: Learning Cassandra

The dynamo paper

RF = 3CL = Quorum

#1

#4

#6

#2

#3

Client

#5

Coordinator

Page 24: Learning Cassandra

The dynamo paper

RF = 3CL = One

#1

#4

#6

#2

#3

Client

#5

Coordinator

+ hint

Page 25: Learning Cassandra

The dynamo paper

RF = 3CL = One

#1

#4

#6

#2

#3

Client

#5

Coordinator

Read repair

Page 26: Learning Cassandra

The big table paper

• Sparse "columnar" data model• SSTable disk storage• Append-only commit log• Memtable (buffer and sort)• Immutable SSTable files• Compactionhttp://labs.google.com/papers/bigtable-osdi06.pdfhttp://www.slideshare.net/geminimobile/bigtable-4820829

Page 27: Learning Cassandra

The big table paper

Name

Value

Column

+ timestamp

Page 28: Learning Cassandra

The big table paper

Name

Value

Column

Name

Value

Column

Name

Value

Column

we can have millions of columns

*

* theoretically up to 2 billion

Page 29: Learning Cassandra

The big table paper

Name

Value

Column

Name

Value

Column

Name

Value

Column

Row Key

Row

Page 30: Learning Cassandra

The big table paper

Column Family

ColumnRow Key Column Column

ColumnRow Key Column Column

ColumnRow Key Column Column

we can have billions of rows

Page 31: Learning Cassandra

The big table paper

Write Memtable

SSTable

SSTable

SSTable

SSTable

Commit Log

Memory

Disk

Flushed on time/size trigger

Immutable

Page 32: Learning Cassandra

Data model basics: conflict resolution

Per-column timestamp-based conflict resolution

http://cassandra.apache.org/

{ column: foo, value: bar, timestamp: 1000}

{ column: foo, value: zing, timestamp: 1001}

Page 33: Learning Cassandra

Data model basics: conflict resolution

Per-column timestamp-based conflict resolution

http://cassandra.apache.org/

{ column: foo, value: bar, timestamp: 1000}

{ column: foo, value: zing, timestamp: 1001}

bigger timestamp

Page 34: Learning Cassandra

Data model basics: column ordering

Columns ordered at time of writing, according to Column Family schema

http://cassandra.apache.org/

{ column: zebra, value: foo, timestamp: 1000}

{ column: badger, value: foo, timestamp: 1001}

Page 35: Learning Cassandra

Data model basics: column ordering

Columns ordered at time of writing, according to Column Family schema

http://cassandra.apache.org/

{ badger: foo, zebra: foo}

with AsciiType column schema

Page 36: Learning Cassandra

Key point

Each “query” can be answered from a single slice of disk

(once compaction has finished)

Page 37: Learning Cassandra

Data modeling – 1000ft introduction

• Start from your queries and work backwards

• Denormalise in the application(store data more than once)

http://www.slideshare.net/mattdennis/cassandra-data-modelinghttp://blip.tv/datastax/data-modeling-workshop-5496906

Page 38: Learning Cassandra

Pattern 1: not using the value

Storing that user X is in bucket Y

Row key: f97be9cc-5255-457…Column name: fooValue: 1

https://github.com/davegardnerisme/we-have-your-kidneys/blob/master/www/add.php#L53-58

we don’t really care about this

Page 39: Learning Cassandra

Pattern 1: not using the value

Q: is user X in bucket foo?f97be9cc-5255-4578-8813-76701c0945bd

bar: 1foo: 1

06a6f1b0-fcf2-41d9-8949-fe2d416bde8ebaz: 1zoo: 1

503778bc-246f-4041-ac5a-fd944176b26daaa: 1

A: single column fetch

Page 40: Learning Cassandra

Pattern 1: not using the value

Q: which buckets is user X in?f97be9cc-5255-4578-8813-76701c0945bd

bar: 1foo: 1

06a6f1b0-fcf2-41d9-8949-fe2d416bde8ebaz: 1zoo: 1

503778bc-246f-4041-ac5a-fd944176b26daaa: 1

A: column slice fetch

Page 41: Learning Cassandra

Pattern 1: not using the value

We could also use expiring columns to automatically delete columns N seconds after insertion

UPDATE users USING TTL = 3600SET 'foo' = 1WHERE KEY = 'f97be9cc-5255-4578-8813-76701c0945bd'

Page 42: Learning Cassandra

Pattern 2: counters

Real-time analytics to count clicks/impressions of ads in hourly buckets

Row key: 1Column name: 2011103015-clickValue: 34

https://github.com/davegardnerisme/we-have-your-kidneys/blob/master/www/adClick.php

Page 43: Learning Cassandra

Pattern 2: counters

Increment by 1 using CQL

UPDATE adsSET '2011103015-impression' = '2011103015-impression' + 1WHERE KEY = '1’

Page 44: Learning Cassandra

Pattern 2: counters

Q: how many clicks/impressions for ad 1 over time range?1

2011103015-click: 12011103015-impression: 34342011103016-click: 122011103016-impression: 54112011103017-click: 22011103017-impression: 345

A: column slice fetch, between column X and Y

Page 45: Learning Cassandra

Pattern 3: time series

Store canonical reference of impressions and clicks

Row key: 20111030Column name: <time UUID>Value: {json}

http://rubyscale.com/2011/basic-time-series-with-cassandra/

Cassandra can order columns by time

Page 46: Learning Cassandra

Pattern 4: object properties as columns

Store user properties such as name, email, etc.

Row key: f97be9cc-5255-457…Column name: nameValue: Bob Foo-Bar

http://www.wehaveyourkidneys.com/adPerformance.php?ad=1

Page 47: Learning Cassandra

Anti-pattern 1: read-before-write

Instead store as independent columns and mutate individually

(see pattern 4)

Page 48: Learning Cassandra

Anti-pattern 2: super columns

Friends don’t let friends use super columns.

http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for-the-unwary/

Page 49: Learning Cassandra

Anti-pattern 3: OPP

The Order Preserving Partitioner unbalances your load and makes your life harder

http://ria101.wordpress.com/2010/02/22/cassandra-randompartitioner-vs-orderpreservingpartitioner/

Page 50: Learning Cassandra

Recap: Data modeling

• Think about the queries, work backwards

• Don’t overuse single rows; try to spread the load

• Don’t use super columns

• Ask on IRC! #cassandra

Page 51: Learning Cassandra

There’s more: Brisk

Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra

DataStax offer this functionality in their “Enterprise” product

http://www.datastax.com/products/enterprise

Page 52: Learning Cassandra

Hive: SQL-like interface to Hadoop

CREATE EXTERNAL TABLE tempUsers (userUuid string, segmentId string, value string)STORED BY 'org.apache.hadoop.hive.cassandra.CassandraStorageHandler'WITH SERDEPROPERTIES ( "cassandra.columns.mapping" = ":key,:column,:value", "cassandra.cf.name" = "users" );

SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC;

Page 53: Learning Cassandra

In conclusion

Cassandra is founded on sound design principles

Page 54: Learning Cassandra

In conclusion

The data model is incredibly powerful

Page 55: Learning Cassandra

In conclusion

CQL and a new breed of clients are making it easier to use

Page 56: Learning Cassandra

In conclusion

Hadoop integration means we can analyse data directly from a Cassandra cluster

Page 57: Learning Cassandra

In conclusion

There is a strong community and multiple companies offering professional support

Page 58: Learning Cassandra

Thanks

Learn more about Cassandrameetup.com/Cassandra-London

Sample ad-targeting project on Github https://github.com/davegardnerisme/we-have-your-kidneys

Watch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations

looking for a job?