using cassandra with your web application

55
Using Cassandra in your Web Applications Tom Melendez, Yahoo!

Upload: supertom

Post on 14-Dec-2014

16.326 views

Category:

Technology


4 download

DESCRIPTION

 

TRANSCRIPT

Page 1: Using Cassandra with your Web Application

Using Cassandra in your Web Applications

Tom Melendez, Yahoo!

Page 2: Using Cassandra with your Web Application

Why do we need another DB?

• We’re really like MySQL• Everyone knows MySQL– and if they don’t, they definitely know SQL, Codd,

Normalization etc.• Lots of tools are based on SQL backends:– 3rd party– home grown

Page 3: Using Cassandra with your Web Application

Should I consider NoSQL?

• Well, maybe• There’s a gazillion NoSQL solutions out there• If you’re already using Memcached on top of

your db, then you should look closely at NoSQL, as you’ve already identified an issue with your current infrastructure.

Page 4: Using Cassandra with your Web Application

Cassandra: Overview

• Eventually consistent• Highly Available• Really fast reads, Really fast writes• Flexible schemas• Distributed• No “Master” - No Single Point of Failure• BigTable plus Dynamo• written in Java

Page 5: Using Cassandra with your Web Application

A little context

• SQL Joins can be expensive• Sharding can be a PITA• Master is a point of failure (that can be mitigated but

we all know its painful)

• The data really might not be that important RIGHT NOW.

• Oh yeah, someone got tired of lousy response times

Page 6: Using Cassandra with your Web Application

A little history

• Released by Facebook as Open Source• Hosted at Google Code for a bit• Now an Apache Project• Based on:– Amazon’s Dynamo• All nodes are Equal (no master)• partitioning/replication

– Google’s Big Table• Column Families

Page 7: Using Cassandra with your Web Application

Sounds great, right?

• When do I throw away our SQL DB?• When do I get my promotion?• When do I go on vacation?

Not So Fast.

Page 8: Using Cassandra with your Web Application

What you talkin’ about, Willis?

Page 9: Using Cassandra with your Web Application

You WILL see this slide again

• You will need to rewrite code and probably re-arch the application

• You will need to run in parallel for testing• You will need training for your Dev and Ops• You will need to develop new tools and processes• Cassandra isn’t the only NoSQL option• You’ll (likely) still need/want SQL somewhere in your

infrastructure

Page 10: Using Cassandra with your Web Application

CAP Theorem

• Consistency – how consistent is the data across nodes?

• Availability – how available is the system?• Partition Tolerance – will the system function

if we lose a piece of it?

CAP Theorem basically says you get to pick 2 of the above.(Anyone else reminded of: “Good, Fast and Cheap, pick two”?)

Page 11: Using Cassandra with your Web Application

CAP and Cassandra

• The tradeoff between CAP are tunable by the client on a per transaction basis

• For example, when adding a user record, you could insist that this transaction is CONSISTENCY.ALL if you wanted.– To really get the benefit Cassandra, you need to

look at what data DOES NOT need CONSISTENCY.ALL

Page 12: Using Cassandra with your Web Application

Consistency Levels: WritesLevel Desc

ZERO Ensure nothing. A write happens asynchronously in background.

ANY A write must have been written to at least one node (can include hinted handoff recipients)

ONE A write has been written to at least 1 replica's commit log and memory table before responding to the client.

QUORUM Ensure that the write has been written to N / 2 + 1 replicas before responding to the client. (N is the Replication Factor)

DCQUORUM Similar to QUORUM but uses RackAwarePlacementStrategy

ALL All replicas must have received the write otherwise the operation will fail.

Page 13: Using Cassandra with your Web Application

Consistency Levels: ReadsLevel Desc

ONE Returns the response from the first replica causing a consistency check in a background thread..

QUORUM Returns the record with the most recent timestamp once a majority of replicas (N / 2 + 1) has reported. (N is the Replication Factor)

DCQUORUM Keeps the reads within a data center via RackAwarePlacementStrategy to avoid the latency of inter-data center communication.

ALL Return the record with the most recent timestamp once all replicas have replied, failing the operation if any replicas are unresponsive.

Page 14: Using Cassandra with your Web Application

Running Cassandra

• Does it fit in your infrastructure?• Clustering/Partitioning• Replication/Snitching• Monitoring• Tuning• Tools/Utilities– A couple exist, but you’ll likely need to build your

own or at least augment what’s available

Page 15: Using Cassandra with your Web Application

Clustering

• The ring– Each node has a unique token (dependent on the

Partitioner used)– Nodes are responsible for their own tokens plus

the node previous to it– the token determines on which node rows are

stored

Page 16: Using Cassandra with your Web Application

Partitioning

• How data is stored on the cluster– Random– Order Preserving– You can implement your own Custom Partitioning

Page 17: Using Cassandra with your Web Application

Partitioning: Types

• Random– Default– Good distribution of data across cluster– Example usage: logging application

• Order Preserving– Good for range queries– OPP has seen some issues on the mailing list lately

• Custom– implement IPartitioner to create your own

Page 18: Using Cassandra with your Web Application

Operations: Replication

• First replica is whatever node claims that range should that node fail

• But the rest are determined with replication strategies• You can tell Cassandra if the nodes are in a rack via

IReplicaPlacementStrategy– RackUnawareStrategy– RackAwareStrategy– You can create your own

• Replication factor – how many copies of the data do we want• These options go in conf/storage-conf.xml

Page 19: Using Cassandra with your Web Application

Operations: Snitching

• Telling Cassandra the physical location of nodes– EndPoint – figure out based on IP address– PropertySnitch – individual IPs to

datacenters/racks– DatacenterEndpointSnitch – give it subnets and

datacenters

Page 20: Using Cassandra with your Web Application

Operations - Monitoring

• IMO, It is critical that you get this working immediately (i.e. as soon as you have something running)

• Basically requires being able to run JMX queries and ideally store this data over time.

• Advice: watch the mailing list. I’m betting a HOWTO will pop up soon as we all have the same problem.

Page 21: Using Cassandra with your Web Application

Operations - Tuning

• You’ve set up monitoring, right?• As you add ColumnFamilies, tuning might change• Things you tune:– Memtables (in mem structure: like a write-back

cache)– Heap Sizing: don’t ramp up the heap without testing

first– key cache: probably want to raise this for reads– row cache

Page 22: Using Cassandra with your Web Application

Utilities: NodeTool

• Really important. Helps you manage your cluster. Find under the bin/ dir in the download– get some disk storage stats– heap memory usage– data snapshot– decommission a node– move a node

Page 23: Using Cassandra with your Web Application

Utilities: cassandra-cli

• This is NOT the equivalent of:– mysql> (although it does provide a prompt)– the mysql executable

• You can do basic get/set operations and some other stuff

• It is really meant to check and see if things are working– Maybe one day it will grow into something more

Page 24: Using Cassandra with your Web Application

Utilities: cassandra-cli

Example:cassandra> set Keyspace1.Standard1['user']['tom'] = 'cool' Value inserted.cassandra> count Keyspace1.Standard1['user'] 1 columnscassandra> get Keyspace1.Standard1['user']['tom'] => (column=746f6d, value=cool, timestamp=1286875497246000)cassandra> show api version2.2.0

Page 25: Using Cassandra with your Web Application

Other Utilities

• stress.py – helps you test the performance of your cluster.– run periodically against your cluster(s)– be prepared with these results when asking for

perf help on the mailing list• binary-memtable – a bulk loader that avoids

some of the Thrift overhead. Use with caution.

Page 26: Using Cassandra with your Web Application

Data Model

• Simply put, it is similar to a multi-dimensional array

• The general strategy is denormalized data, sacrificing disk space for speed/efficiency

• Think about your queries (your DBAs will like this, but won’t like the way it is done!)

• You’ll end up getting very creative• You need to know your queries in advance,

they ultimately define your schema.

Page 27: Using Cassandra with your Web Application

Data Model

• Again, keep in mind that you’re (probably) after denormalizing.

• I know it’s painful. • Terms you’ll see:– Keyspaces– Column Families– SuperColumns– Indexes– Queries

Page 28: Using Cassandra with your Web Application

Data Model

• Column Family– Think of it as a DB table

• Column– Key-Value Pair (NOT just a value, like a DB column)– they also have a timestamp

• SuperColumn– Columns inside a column– So, you have a key, and its value are columns– no timestamp

• Keyspace – like a namespace, generally 1 per app

Page 29: Using Cassandra with your Web Application

Data Model Indexes and Queries

• Here is where you get creative• Regardless of the partitioner, rows are always

stored sorted by key• Column sorting: – CompareWith and CompareSubcolumnsWith

Types

ASCIIType LongType

LexicalUUIDType TimeUUIDType

UTF8Type

Page 30: Using Cassandra with your Web Application

Data Model: Indexes and Queries

• Your bag of tricks include:– creating column families for each query– getting the row key to be the WHERE of your SQL

query– using column and SuperColumn names as “values”• columns are stored sorted within the row

Page 31: Using Cassandra with your Web Application

Data Model: Example• Example data set:

“b”: {“name”:”Ben”, “street”:”1234 Oak St.”, “city”:”Seattle”, “state”:”WA”} “jason”: {”name”:”Jason”, “street”:”456 First Ave.”, “city”:”Bellingham”, “state”:”WA”} “zack”: {”name”: “Zack”, “street”: “4321 Pine St.”, “city”: “Seattle”, “state”: “WA”} “jen1982”: {”name”:”Jennifer”, “street”:”1120 Foo Lane”, “city”:”San Francisco”,

“state”:”CA”} “albert”: {”name”:”Albert”, “street”:”2364 South St.”, “city”:”Boston”, “state”:”MA”}

(Taken from Benjamin Black’s presentation on indexing – twitter: @b6n)

Page 32: Using Cassandra with your Web Application

Data Model: Example

• Given that data set, we want to say:– SELECT name FROM Users WHERE state=“WA”

• We create a ColumnFamily:<ColumnFamily Name=”LocationUserIndexSCF” CompareWith=”UTF8Type” CompareSubcolumnsWith=”UTF8Type” ColumnType=”Super” />

(Taken from Benjamin Black’s presentation on indexing – twitter: @b6n)

Page 33: Using Cassandra with your Web Application

Data Model: Example

• Which looks like this:[state]: { [city1]: {[name1]:[user1], [name2]:[user2], ... }, [city2]: {[name3]:[user3], [name4]:[user4], ... }, ... [cityX]: {[name5]:[user5], [name6]:[user6], ... } }

• State is the row key, so we can select by it and we’ll get the city grouping and name sorting basically for free.

(Taken from Benjamin Black’s presentation on indexing – twitter @b6n)

Page 34: Using Cassandra with your Web Application

Talking to Cassandra

• Generally two ways to do this:– Native clients (ideal)– Thrift– Avro support is coming

• All of the PHP clients are still very Alpha• All the PHP clients use Thrift that I’ve seen• If you can, please use them and file bugs.• Or even better than that – FIX IT YOURSELF!• If you need something more stable, use Thrift

Page 35: Using Cassandra with your Web Application

PHP Clients

• Pandra (LGPL) • PHP Cassa – pycassa port • Simple Cassie (New BSD License) • Prophet (PHP License)• Clients in other languages are further along

Thanks to Chris Barber (@cb1inc) for this list

Page 36: Using Cassandra with your Web Application

Raw Cassandra API

• These are wrapped differently per client but generally exposed by thrift. These are just the major data manip methods, there are others to gather information, etc..

• Full list is here: http://wiki.apache.org/cassandra/API

Page 37: Using Cassandra with your Web Application

Raw Cassandra API

• get• get_count• get_key_range• get_range_slices• get_slice• multiget_slice• insert• batch_mutate• remove• truncate

Page 38: Using Cassandra with your Web Application

What is Thrift?

• Thrift is a remote procedure call framework developed at Facebook for "scalable cross-language services development” – Wikipedia

• In short, you define a .thrift file (IDL file), with data structures, services, etc. and run the “thrift compiler” and get code, which you then use– PHP, Java, Perl, Python, C#, Erlang, Ruby (and probably others) are

supported– thrift -php myproject.thrift is what you run– Generated files are in a dir called: gen-php– Then go in and add your logic

Page 39: Using Cassandra with your Web Application

Example IDL file• Heavily Snipped from: http://wiki.apache.org/thrift/Tutorial

# Thrift Tutorial (heavily snipped)# Mark Slee ([email protected])# C and C++ comments also supportedinclude "shared.thrift"

namespace phptutorial

service Calculator extends shared.SharedService { void ping(), i32 add(1:i32 num1, 2:i32 num2), i32 calculate(1:i32 logid, 2:Work w) throws (1:InvalidOperation ouch), oneway void zip(),}

Page 40: Using Cassandra with your Web Application

Installing Thrift and the PHP ext

• Download and install Thrift– http://incubator.apache.org/thrift/download/

• To use PHP, you install the PHP extension “thrift_protocol”– You’ll find this in the Thrift download above

• Steps– cd PATH-TO-THRIFT/lib/php/src/ext/thrift_protocol– phpize && ./configure --enable-thrift_protocol && make– sudo cp modules/thrift_protocol.so /php/ext/dir– add extension=thrift_protocol.so to the appropriate php.ini

file• You really need APC, too (http://www.php.net/apc)

Page 42: Using Cassandra with your Web Application

So, who’s using this thing?

• Big and small companies alike• Not sure if they’re applications of Cassandra

are mission-critical• Yahoo! is NOT a user, but we have our own

implementation, and that implementation IS mission critical. Do a search for “PNUTS”

Page 43: Using Cassandra with your Web Application

Facebook – Inbox search

Page 44: Using Cassandra with your Web Application

Heavy users, but not for tweets. Yet.

Page 45: Using Cassandra with your Web Application

Probably the biggest consumer-facing users of Cassandra

Page 46: Using Cassandra with your Web Application

Digg - continued

• These guys have provided a lot– Patches– Documentation/Blogs/Advocacy– LazyBoy Python client: http://github.com/digg/lazyboy#readme

Page 47: Using Cassandra with your Web Application

• Not totally sure, probably logging the massive amounts of data the generate from routers, switches and other hardware– http://www.rackspacecloud.com/blog/2010/06/07/

speaking-session-on-cassandra-at-velocity-2010/

Page 48: Using Cassandra with your Web Application

Others using Cassandra

• Comcast, Cisco, CBS Interactive– http://www.dbthink.com/?p=183

Page 49: Using Cassandra with your Web Application

Competitors, sort of

• CouchDB – document db, accessible via javascript and REST

• HBase – no SOPF, Column Families, runs on top of Hadoop

• Memcached – used with MySQL, FB are big users

• MongoDB – cool online shell; k/v store, document db

• Redis – see Cassandra vs. Redis presentation by @tlossen from NoSQL Frankfurt 9/28/2010

• Voldemort – distributed db, built by LinkedIn

Page 50: Using Cassandra with your Web Application

Cassandra and Hadoop and Pig/Hive

• Yes, it is possible, I haven’t done it myself• 0.6x Cassandra - Hadoop M/R jobs can read from

Cassandra • 0.7x Cassandra – Hadoop M/R jobs can write to it (again,

according to the docs)

• Pig: own implementation of LoadFunc; Hive work has been started

• See: – http://wiki.apache.org/cassandra/HadoopSupport– github.com/stuhood/cassandra-summit-demo– slideshare.net/jeromatron cassandrahadoop-4399672– Hive: https://issues.apache.org/jira/browse/CASSANDRA-913

Page 51: Using Cassandra with your Web Application

Developing Cassandra itself• Using Eclipse• http://wiki.apache.org/cassandra/RunningCassandraInEclipse

Page 52: Using Cassandra with your Web Application

My personal recommendations

• Not that you asked.• Understand that this is bleeding-edge• You’re giving up a lot of SQL comforts• Evaluate if you really need this (like anything else)• If so, go with the latest and greatest and create a

procedure to keep you running the latest and greatest (that would be 0.7x)

• Contribute back – it is good for your company and for you.• Consider commercial support: http://www.riptano.com

(I’m not affiliated in any way)

Page 53: Using Cassandra with your Web Application

Think about during your evaluation:

• Are we just in another cycle?– Fat client, thin client, Big bandwidth, little

bandwidth, big transactions, micro transactions• Have we been here before?– Remember dbase, Foxpro, Sleepycat/BerkeleyDB?

• Is it just a technology Fad?– How many people developed in WML/HDML only

have phones support full HTML/JS?– Do we all need native Iphone Apps?

Page 54: Using Cassandra with your Web Application

I told you that you’d see this again…

• You will need to rewrite code and probably re-arch the application

• You will need to run in parallel for testing• You will need training for your Dev and Ops• You will need to develop new tools and processes• Cassandra isn’t the only NoSQL option• You’ll (likely) still need/want SQL

Page 55: Using Cassandra with your Web Application

Thanks!

• http://wiki.apache.org/cassandra/GettingStarted

• http:///www.riptano.com/blog/slides-and-videos-cassandra-summit-2010