cassandra - o'reilly mediaassets.en.oreilly.com/1/event/27/cassandra_ open source...

44
Cassandra Jonathan Ellis

Upload: votruc

Post on 04-Jun-2018

216 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Cassandra

Jonathan Ellis

Page 2: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Motivation

● Scaling reads to a relational database is hard

● Scaling writes to a relational database is virtually impossible● … and when you do, it usually isn't relational

anymore

Page 3: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

The new face of data

● Scale out, not up● Online load balancing, cluster growth● Flexible schema● Key-oriented queries● CAP-aware

Page 4: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

CAP theorem

● Pick two of Consistency, Availability, Partition tolerance

Page 5: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Two famous papers

● Bigtable: A distributed storage system for structured data, 2006

● Dynamo: amazon's highly available key-value store, 2007

Page 6: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Two approaches

● Bigtable: “How can we build a distributed db on top of GFS?”

● Dynamo: “How can we build a distributed hash table appropriate for the data center?”

Page 7: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

10,000 ft summary

● Dynamo partitioning and replication● Log-structured ColumnFamily data model

similar to Bigtable's

Page 8: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Cassandra highlights

● High availability● Incremental scalability● Eventually consistent● Tunable tradeoffs between consistency

and latency● Minimal administration● No SPF

Page 9: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006
Page 10: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006
Page 11: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006
Page 12: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006
Page 13: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Dynamo architecture & Lookup

Page 14: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Architecture details

● O(1) node lookup● Explicit replication● Eventually consistent

Page 15: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Architecture layers

Messaging service

Gossip

Failure detection

Cluster state

Partitioner

Replication

Commit log

Memtable

SSTable

Indexes

Compaction

Tombstones

Hinted handoff

Read repair

Bootstrap

Monitoring

Admin tools

Page 16: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Writes

● Any node● Partitioner● Commitlog, memtable● SSTable● Compaction● Wait for W responses

Page 17: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Memtable / SSTable

Commit log

Disk

Page 18: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

SSTable format

● Key / data

Page 19: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

SSTable Indexes

● Bloom filter● Key● Column

(Similar to Hadoop MapFile / Tfile)

Page 20: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Compaction

● Merge keys● Combine columns● Discard tombstones

Page 21: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Remove

● Deletion marker (tombstone) necessary to suppress data in older SSTables, until compaction

● Read repair complicates things a little● Eventually consistent complicates things

more● Solution: configurable delay before

tombstone GC, after which tombstones are not repaired

Page 22: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Cassandra write properties

● No reads● No seeks● Fast● Atomic within ColumnFamily● Always writable

Page 23: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Read path

● Any node● Partitioner● Wait for R responses● Wait for N – R responses in the

background and perform read repair

Page 24: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Cassandra read properties

● Read multiple SSTables● Slower than writes (but still fast)● Seeks can be mitigated with more RAM● Scales to billions of rows

Page 25: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Consistency in a BASE world

● If W + R > N, you will have consistency● W=1, R=N● W=N, R=1● W=Q, R=Q where Q = N / 2 + 1

Page 26: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

vs MySQL with 50GB of data

● MySQL● ~300ms write

● ~350ms read

● Cassandra● ~0.12ms write

● ~15ms read

● Achtung!

Page 27: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Data model

● Rows, ColumnFamilies, Columns

Page 28: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

ColumnFamilies

keyA column1 column2 column3

keyC column1 column7 column11

Column

Byte[] Name

Byte[] Value

I64 timestamp

Page 29: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Super ColumnFamilies

keyF Super1 Super2

keyJ Super1 Super5

column column column column column column

column column column column column column

Page 30: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Types of queries

● Single column● Slice

● Set of names / range of names

● Simple slice -> columns

● Super slice -> supercolumns

● Key range

Page 31: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Range queries

● Add “master” server● Implement on top of K/V● Order-preserving partitioning

Page 32: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Modification

● Insert / update● Remove● Single column or batch● Specify W, number of nodes to wait for

Page 33: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Thriftstruct Column {   1: binary                        name,   2: binary                        value,   3: i64                           timestamp,}

struct SuperColumn {   1: binary                        name,   2: list<Column>                  columns,}

Column get_column(table, key, column_path, block_for=1)

list<string> get_key_range(table, column_family, start_with="", stop_at="", max_results=100)

void insert(table, key, column_path, value, timestamp, block_for=0)

void remove(tablename, key, column_path_or_parent, timestamp)

Page 34: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Honestly, Thrift kinda sucks

Page 35: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Example: a multiuser blog

Two queries

- the most recent posts belonging to a given blog, in reverse chronological order

- a single post and its comments, in chronological order

Page 36: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

First try

JBE blog

Cassandra is teh awesome BASE FTW

Evan blog

I like kittens And Ruby

post comment comment post comment comment

post comment comment post comment comment

<ColumnFamily

Type="Super"

CompareWith="TimeString"

CompareSubcolumnsWith="UUID"

Name="Blog"/>

Page 37: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Second try

<ColumnFamily

CompareWith="UUIDType"

Name="Blog"/>

JBE blog Cassandra is teh awesome

BASE FTW

Evan blog I like kittens And Ruby

Cassandra is teh awesome

comment comment

Base FTW comment comment

I like kittens

comment comment

And Ruby comment comment

<ColumnFamily

CompareWith="UUIDType"

Name="Comment"/>

Page 38: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Roadmap

Page 39: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Cassandra 0.3

● Remove support● OPP / Range queries● Test suite● Workarounds for JDK bugs● Rudimentary multi-datacenter support

Page 40: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Cassandra 0.4

● Branched May 18● Data file format change to support billions

of rows per node instead of millions● API changes (no more colon delimiters)● Multi-table (keyspace) support● LRU key cache● fsync support● Bootstrap● Web interface

Page 41: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Cassandra 0.5

● Bootstrap● Load balancing

● Closely related to “bootstrap done right”

● Merkle tree repair● Millions of columns per row

● This will require another data format change

● Multiget● Callout support

Page 42: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Users

Production: facebook, RocketFuel

Production RSN: Digg, Rackspace

No date yet: IBM Research, Twitter

Evaluating: 50+ in #cassandra on freenode

Page 43: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

More

● Eventual consistency: http://www.allthingsdistributed.com/2008/12/eventually_consistent.html

● Introduction to distributed databases by Todd Lipcon at NoSQL 09: http://www.vimeo.com/5145059

● Other articles/videos about Cassandra: http://wiki.apache.org/cassandra/ArticlesAndPresentations

● #cassandra on irc.freenode.net

Page 44: Cassandra - O'Reilly Mediaassets.en.oreilly.com/1/event/27/Cassandra_ Open Source Bigtable... · Two famous papers Bigtable: A distributed storage system for structured data, 2006

Cassandra