Learning Cassandra

Download Learning Cassandra

Post on 15-Jan-2015

10.093 views

Category:

Technology

4 download

Embed Size (px)

DESCRIPTION

Context to choosing NoSQL, learning Cassandra basics plus some basic data modelling patterns and anti-patterns.

TRANSCRIPT

<ul><li> 1. Learning CassandraDave Gardner@davegardnerisme</li></ul> <p> 2. What Im going to cover How to NoSQL Cassandra basics (dynamo and big table) How to use the data model in real life 3. How to NoSQL 1.Find data store that doesnt use SQL 2.Anything 3.Cram all the things into it 4.Triumphantly blog this success 5.Complain a month later when it bursts into flames http://www.slideshare.net/rbranson/how-do-i-cassandra/4 4. Choosing NoSQLNoSQL DBs trade off traditionalfeatures to better support new andemerging use caseshttp://www.slideshare.net/argv0/riak-use-cases-dissecting-the-solutions-to-hard-problems 5. Choosing Cassandra: Tradeoffs More widely used, tested and documented software MySQL first OS release 1998 For a relatively immature product Cassandra first open-sourced in 2008 6. Choosing Cassandra: Tradeoffs Ad-hoc querying SQL join, group by, having, order For a rich data model with limited ad-hoc querying ability Cassandra makes you denormalise 7. Choosing NoSQLthey say I cant decide between this project andthis project even though they look nothing like eachother. And the fact that you cant decide indicates thatyou dont actually have a problem that requiresthem.Benjamin Black NoSQL Tapes (at 30:15)http://nosqltapes.com/video/benjamin-black-on-nosql-cloud-computing-and-fast_ip 8. What do we get in return? Proven horizontal scalability Cassandra scales reads and writes linearly as new nodes are added 9. Netflix benchmark: linear scalinghttp://techblog.netflix.com/2011/11/benchmarking-cassandra-scalability-on.html 10. What do we get in return? High availability Cassandra is fault-resistant with tunable consistency levels 11. What do we get in return? Low latency, solid performance Cassandra has very good write performance 12. Performance benchmark * http://blog.cubrid.org/dev- platform/nosql-benchmarking/* Add pinch of salt 13. What do we get in return? Operational simplicity Homogenous cluster, no master node, no SPOF 14. What do we get in return? Rich data model Cassandra is more than simple key- value columns, composites, counters, secondary indexes 15. How to NoSQL version 2 Learn about each solution What tradeoffs are you making? How is it designed? What algorithms does it use? http://www.alberton.info/nosql_databases_what_when_why_phpuk201 1.html 16. Amazon Dynamo+ Google Big TableConsistent hashing ColumnarVector clocks * SSTable storageGossip protocol Append-onlyHinted handoff MemtableRead repairCompactionhttp://www.allthingsdistributed.com/fi http://labs.google.com/papers/bigles/amazon-dynamo-sosp2007.pdf table-osdi06.pdf* not in Cassandra 17. The dynamo paper # tokens are 1 integers from 0 to 2127 # # 6 2 # # 5 3Client # 4 18. The dynamo paper#1 ## 62 consistent hashing Coordinator ## 53Client#4 19. Consistency levels How many replicas must respond to declare success? 20. Consistency levels: read operationsLevelDescriptionONE1st ResponseQUORUM N/2 + 1 replicasLOCAL_QUORUM N/2 + 1 replicas in local data centreEACH_QUORUMN/2 + 1 replicas in each data centreALLAll replicas http://wiki.apache.org/cassandra/API#Read 21. Consistency levels: write operationsLevelDescriptionANYOne node, including hinted handoffONEOne nodeQUORUM N/2 + 1 replicasLOCAL_QUORUM N/2 + 1 replicas in local data centreEACH_QUORUMN/2 + 1 replicas in each data centreALLAll replicas http://wiki.apache.org/cassandra/API#Write 22. The dynamo paper # 1 RF = 3 CL = One # # 6 2 Coordinator # # 5 3Client # 4 23. The dynamo paper # 1 RF = 3 CL = Quorum # # 6 2 Coordinator # # 5 3Client # 4 24. The dynamo paper # 1RF = 3CL = One # + hint # 62 Coordinator ## 53Client # 4 25. The dynamo paper # 1RF = 3CL = One #Read# 62 repair Coordinator ## 53Client # 4 26. The big table paper Sparse "columnar" data model SSTable disk storage Append-only commit log Memtable (buffer and sort) Immutable SSTable files Compaction http://labs.google.com/papers/bigtable-osdi06.pdf http://www.slideshare.net/geminimobile/bigtable-4820829 27. The big table paper+ timestamp Name Value Column 28. The big table paperwe can have millionsof columns * Name NameName ValueValue Value Column Column Column* theoretically up to 2 billion 29. The big table paper Row Name Name Name Row Key ValueValueValue Column Column Column 30. The big table paperColumn Family Row Key ColumnColumn Column Row Key ColumnColumnColumn Row Key ColumnColumnColumnwe can have billions of rows 31. The big table paperWrite MemtableFlushed on time/size triggerMemory DiskCommit Log SSTableSSTable SSTableSSTable Immutable 32. Data model basics: conflict resolution Per-column timestamp-based conflict resolution {{ column: foo, column: foo, value: bar,value: zing, timestamp: 1000timestamp: 1001 }} http://cassandra.apache.org/ 33. Data model basics: conflict resolution Per-column timestamp-based conflict resolution {{ column: foo, column: foo, value: bar,value: zing, timestamp: 1000timestamp: 1001 }} bigger timestamp http://cassandra.apache.org/ 34. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema {{ column: zebra, column: badger, value: foo,value: foo, timestamp: 1000timestamp: 1001 }} http://cassandra.apache.org/ 35. Data model basics: column ordering Columns ordered at time of writing, according to Column Family schema { badger: foo, with AsciiType column zebra: foo schema } http://cassandra.apache.org/ 36. Key point Each query can be answered from a single slice of disk (once compaction has finished) 37. Data modeling 1000ft introduction Start from your queries and work backwards Denormalise in the application (store data more than once) http://www.slideshare.net/mattdennis/cassandra-data-modeling http://blip.tv/datastax/data-modeling-workshop-5496906 38. Pattern 1: not using the value Storing that user X is in bucket Y Row key:f97be9cc-5255-457 Column name:foo Value:1we dont really care about this https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/add.php#L53-58 39. Pattern 1: not using the value Q: is user X in bucket foo? f97be9cc-5255-4578-8813-76701c0945bdbar: 1A: single columnfoo: 1fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8ebaz: 1zoo: 1 503778bc-246f-4041-ac5a-fd944176b26daaa: 1 40. Pattern 1: not using the value Q: which buckets is user X in? f97be9cc-5255-4578-8813-76701c0945bdbar: 1A: column slicefoo: 1fetch 06a6f1b0-fcf2-41d9-8949-fe2d416bde8ebaz: 1zoo: 1 503778bc-246f-4041-ac5a-fd944176b26daaa: 1 41. Pattern 1: not using the value We could also use expiring columns to automatically delete columns N seconds after insertion UPDATE users USING TTL = 3600 SET foo = 1 WHERE KEY = f97be9cc-5255-4578-8813-76701c0945bd 42. Pattern 2: counters Real-time analytics to count clicks/impressions of ads in hourly buckets Row key:1 Column name:2011103015-click Value:34 https://github.com/davegardnerisme/we-have-your- kidneys/blob/master/www/adClick.php 43. Pattern 2: counters Increment by 1 using CQL UPDATE ads SET 2011103015-impression = 2011103015-impression + 1 WHERE KEY = 1 44. Pattern 2: counters Q: how many clicks/impressions for ad 1 over time range? 1 2011103015-click: 1 2011103015-impression: 3434 A: column slice 2011103016-click: 12 fetch, between 2011103016-impression: 5411 column X and Y 2011103017-click: 2 2011103017-impression: 345 45. Pattern 3: time series Store canonical reference of impressions and clicks Row key:20111030 Column name: Value:{json}Cassandra can order columns by time http://rubyscale.com/2011/basic-time-series-with-cassandra/ 46. Pattern 4: object properties as columns Store user properties such as name, email, etc. Row key: f97be9cc-5255-457 Column name: name Value: Bob Foo-Bar http://www.wehaveyourkidneys.com/adPerformance.php?ad=1 47. Anti-pattern 1: read-before-write Instead store as independent columns and mutate individually (see pattern 4) 48. Anti-pattern 2: super columns Friends dont let friends use super columns. http://rubyscale.com/2010/beware-the-supercolumn-its-a-trap-for- the-unwary/ 49. Anti-pattern 3: OPP The Order Preserving Partitioner unbalances your load and makes your life harder http://ria101.wordpress.com/2010/02/22/cassandra- randompartitioner-vs-orderpreservingpartitioner/ 50. Recap: Data modeling Think about the queries, work backwards Dont overuse single rows; try to spread the load Dont use super columns Ask on IRC! #cassandra 51. Theres more: Brisk Integrated Hadoop distribution (without HDFS installed). Run Hive and Pig queries directly against Cassandra DataStax offer this functionality in their Enterprise product http://www.datastax.com/products/enterprise 52. Hive: SQL-like interface to HadoopCREATE EXTERNAL TABLE tempUsers(userUuid string, segmentId string, value string)STORED BYorg.apache.hadoop.hive.cassandra.CassandraStorageHandlerWITH SERDEPROPERTIES ("cassandra.columns.mapping" = ":key,:column,:value","cassandra.cf.name" = "users");SELECT segmentId, count(1) AS totalFROM tempUsersGROUP BY segmentIdORDER BY total DESC; 53. In conclusion Cassandra is founded on sound design principles 54. In conclusion The data model is incredibly powerful 55. In conclusion CQL and a new breed of clients are making it easier to use 56. In conclusion Hadoop integration means we can analyse data directly from a Cassandra cluster 57. In conclusion There is a strong community and multiple companies offering professional support 58. Thankslooking for a job?Learn more about Cassandrameetup.com/Cassandra-LondonSample ad-targeting project on Githubhttps://github.com/davegardnerisme/we-have-your-kidneysWatch videos from Cassandra SF 2011http://www.datastax.com/events/cassandrasf2011/presentations </p>