Storage cassandra

Download Storage   cassandra

Post on 10-May-2015




0 download

Embed Size (px)


<ul><li>1.<ul><li>Cassandra </li></ul></li></ul> <p>Roc.Yang 2011.04 2. Contents Overview 1 2 Data Model Storage Model 3 4 System Architecture Read &amp; Write 5 6 Other 3. Cassandra </p> <ul><li>Overview </li></ul> <p>4. Cassandra From Facebook 5. Cassandra To 6. Cassandra From Dynamo and Bigtable </p> <ul><li>Cassandra is ahighly scalable ,eventually consistent ,distributed ,structured key-valuestore.Cassandra brings together the distributed systems technologies fromDynamoand the data model from Google'sBigTable .</li></ul> <ul><li><ul><li>Like Dynamo, Cassandra iseventually consistent .</li></ul></li></ul> <ul><li><ul><li>Like BigTable, Cassandra provides a ColumnFamily-based data model richer than typicalkey/valuesystems.</li></ul></li></ul> <ul><li>Cassandra was open sourced byFacebookin 2008, where it was designed by Avinash Lakshman (one of the authors of Amazon's Dynamo) and Prashant Malik ( Facebook Engineer). In a lot of ways you can think of Cassandra as Dynamo 2.0 ora marriage of Dynamo and BigTable .</li></ul> <p>7. Cassandra - Overview </p> <ul><li>Cassandra is a distributed storage system for managingvery large amounts of structured dataspread out across many commodity servers, while providinghighly availableservicewith no single point of failure ; </li></ul> <ul><li>Cassandra doesnot support a full relational data model ; instead, it provides clients with a simple data model that supportsdynamic control over data layout and format . </li></ul> <p>8. Cassandra - Highlights </p> <ul><li>High availability </li></ul> <ul><li>Incremental scalability </li></ul> <ul><li>Eventually consistent </li></ul> <ul><li>Tunable tradeoffs between consistency and latency </li></ul> <ul><li>Minimal administration </li></ul> <ul><li>No SPF( Single Point of Failure ) . </li></ul> <p>9. Cassandra Trade Offs </p> <ul><li>No Transactions </li></ul> <ul><li>No Adhoc Queries </li></ul> <ul><li>No Joins </li></ul> <ul><li>No Flexible Indexes </li></ul> <ul><li>Data Modeling with Cassandra Column Families </li></ul> <ul><li> </li></ul> <p>10. Cassandra From Dynamo and BigTable </p> <ul><li>Introduction to Cassandra: Replication and Consistency</li></ul> <ul><li> </li></ul> <p>11. Dynamo-like Features </p> <ul><li>Symmetric, P2P Architecture</li></ul> <ul><li><ul><li>No Special Nodes/SPOFs </li></ul></li></ul> <ul><li>Gossip-based Cluster Management </li></ul> <ul><li>Distributed Hash Table for Data Placement </li></ul> <ul><li><ul><li>Pluggable Partitioning </li></ul></li></ul> <ul><li><ul><li>Pluggable Topology Discovery </li></ul></li></ul> <ul><li><ul><li>Pluggable Placement Strategies </li></ul></li></ul> <ul><li>Tunable, Eventual Consistency </li></ul> <ul><li>Data Modeling with Cassandra Column Families </li></ul> <ul><li> </li></ul> <p>12. BigTable-like Features </p> <ul><li>Sparse Columnar Data Model </li></ul> <ul><li><ul><li>Optional, 2-level Maps Called Super Column Families </li></ul></li></ul> <ul><li>SSTable Disk Storage </li></ul> <ul><li><ul><li>Append-only Commit Log </li></ul></li></ul> <ul><li><ul><li>Memtable(buffer and sort) </li></ul></li></ul> <ul><li><ul><li>Immutable SSTable Files </li></ul></li></ul> <ul><li>Hadoop Integration </li></ul> <ul><li>Data Modeling with Cassandra Column Families </li></ul> <ul><li> </li></ul> <p>13. Brewer's CAP Theorem </p> <ul><li>CAP ( C onsistency,A vailability andP artition Tolerance ) . </li></ul> <ul><li>Pick two ofC onsistency,A vailability,P artition tolerance. </li></ul> <ul><li>Theorem: You can haveat most twoof these properties for any shared-data system. </li></ul> <p> 14. ACID &amp; BASE </p> <ul><li>ACID( A tomicity,C onsistency,I solation,D urability). </li></ul> <ul><li>BASE( B asicallyA vailable,S oft-state,E ventually Consistent) </li></ul> <p>ACID: ACID and BASE: MySQL and NoSQL :http:// </p> <ul><li>ACID </li></ul> <ul><li><ul><li>Strong consistency </li></ul></li></ul> <ul><li><ul><li>Isolation </li></ul></li></ul> <ul><li><ul><li>Focus on commit </li></ul></li></ul> <ul><li><ul><li>Nested transactions </li></ul></li></ul> <ul><li><ul><li>Availability? </li></ul></li></ul> <ul><li><ul><li>Conservative </li></ul></li></ul> <ul><li><ul><li>(pessimistic) </li></ul></li></ul> <ul><li><ul><li>Difficult evolution </li></ul></li></ul> <ul><li><ul><li>(e.g. schema) </li></ul></li></ul> <ul><li>BASE </li></ul> <ul><li><ul><li>Weak consistency </li></ul></li></ul> <ul><li><ul><li>stale data OK </li></ul></li></ul> <ul><li><ul><li>Availability first </li></ul></li></ul> <ul><li><ul><li>Best effort </li></ul></li></ul> <ul><li><ul><li>Approximate answers OK </li></ul></li></ul> <ul><li><ul><li>Aggressive (optimistic) </li></ul></li></ul> <ul><li><ul><li>Simpler! </li></ul></li></ul> <ul><li><ul><li>Faster </li></ul></li></ul> <ul><li><ul><li>Easier evolution </li></ul></li></ul> <p>15. NoSQL </p> <ul><li>The term "NoSQL" was used in 1998 as the name fora lightweight, open source relational databasethat did not expose a SQL interface. Its author, Carlo Strozzi, claims that as the NoSQL movement "departs from the relational model altogether; it should therefore have been called more appropriately 'NoREL', or something to that effect. </li></ul> <ul><li><ul><li>CAP </li></ul></li></ul> <ul><li><ul><li>BASE </li></ul></li></ul> <ul><li><ul><li>E ventual Consistency </li></ul></li></ul> <p>NoSQL: / 16. Dynamo &amp; Bigtable </p> <ul><li>Dynamo partitioning and replication </li></ul> <ul><li>Log-structured ColumnFamily data model similar to Bigtable's </li></ul> <ul><li>Bigtable: A distributed storage system for structured data , 2006 </li></ul> <ul><li>Dynamo: amazon's highly available keyvalue store , 2007 </li></ul> <p>17. Dynamo &amp; Bigtable </p> <ul><li>BigTable </li></ul> <ul><li><ul><li>Strong consistency </li></ul></li></ul> <ul><li><ul><li>Sparse map data model </li></ul></li></ul> <ul><li><ul><li>GFS, Chubby, etc </li></ul></li></ul> <ul><li>Dynamo </li></ul> <ul><li><ul><li>O(1) distributed hash table (DHT) </li></ul></li></ul> <ul><li><ul><li>BASE (eventual consistency) </li></ul></li></ul> <ul><li><ul><li>Client tunable consistency/availability </li></ul></li></ul> <p>18. Dynamo &amp; Bigtable </p> <ul><li> CP </li></ul> <ul><li><ul><li>Bigtable </li></ul></li></ul> <ul><li><ul><li>Hypertable </li></ul></li></ul> <ul><li><ul><li>HBase </li></ul></li></ul> <ul><li>AP </li></ul> <ul><li><ul><li>Dynamo </li></ul></li></ul> <ul><li><ul><li>Voldemort </li></ul></li></ul> <ul><li><ul><li>Cassandra </li></ul></li></ul> <p>19. Cassandra </p> <ul><li>Dynamo Overview </li></ul> <p>20. Dynamo Architecture &amp; Lookup </p> <ul><li>O(1) node lookup </li></ul> <ul><li>Explicit replication </li></ul> <ul><li>Eventually consistent </li></ul> <p>21. Dynamo </p> <ul><li>Dynamo:</li></ul> <ul><li>a highly available key-value storage system that some of Amazons core services use to provide an always-on experience. </li></ul> <ul><li>a software dynamic optimization system that is capable of transparently improving the performance of a native instruction stream as it executes on the processor . </li></ul> <p>22. Service-Oriented Architecture 23. Dynamo Techniques </p> <ul><li>Dynamo </li></ul> <p> vector clock Hinted handoff W,R,N quorum Merkle gossip 24. Dynamo Techniques Advantages </p> <ul><li>Summary of techniques used inDynamoand their advantages </li></ul> <p>25. Dynamo </p> <ul><li>-- </li></ul> <ul><li>-- </li></ul> <p>26. Dynamo </p> <ul><li> Vector Clock </li></ul> <p>27. Dynamo </p> <ul><li> W R N </li></ul> <ul><li>N </li></ul> <ul><li>W </li></ul> <ul><li>R </li></ul> <ul><li> R+W&gt;N R W </li></ul> <p>28. Dynamo </p> <ul><li>Merkle </li></ul> <ul><li>Dynamo Merkle </li></ul> <p>29. Dynamo </p> <ul><li> Gossip </li></ul> <p>30. Consistent Hashing - Dynamo </p> <ul><li>Dynamo server v (n*v) vnode </li></ul> <p>31. Consistent Hashing - Dynamo 32. Cassandra </p> <ul><li>Bigtable Overview </li></ul> <p>33. Bigtable 34. Bigtable </p> <ul><li>Tablet </li></ul> <ul><li><ul><li> BigtableT tablet 100 200MB/tablet </li></ul></li></ul> <ul><li>Column Families </li></ul> <ul><li><ul><li>the basic unit of access control; </li></ul></li></ul> <ul><li><ul><li>All data stored in a column family is usually of the same type (we compress data in the same column family together). </li></ul></li></ul> <ul><li>Timestamp </li></ul> <ul><li><ul><li>Each cell in a Bigtable can containmultiple versionsof the same data; these versions are indexed by timestamp. </li></ul></li></ul> <ul><li>Treats data as uninterpreted strings </li></ul> <p>35. Bigtable: Data Model </p> <ul><li> triple for key - lookup, insert, and delete API </li></ul> <ul><li>Arbitrary columns on a row-by-row basis </li></ul> <ul><li><ul><li>Column family:qualifier. Family is heavyweight, qualifier lightweight </li></ul></li></ul> <ul><li><ul><li>Column-oriented physical store- rows are sparse! </li></ul></li></ul> <ul><li>Does not support a relational model </li></ul> <ul><li><ul><li>No table-wide integrity constraints </li></ul></li></ul> <ul><li><ul><li>No multirow transactions </li></ul></li></ul> <p>36. </p> <ul><li>a three-level hierarchy analogous to that of a B+ tree to store tablet location information </li></ul> <p>Bigtable: Tablet location hierarchy 37. Bigtable:METADATA </p> <ul><li>The first level is a file stored in Chubby that contains the location of theroot tablet </li></ul> <ul><li>Theroot tabletcontains the location of all tablets in a special METADATA table </li></ul> <ul><li>The METADATA table stores the location of a tablet under a row key that is an encoding of the tablet's table identier and its end row </li></ul> <ul><li>Each METADATA row stores approximately 1KB of data in memory </li></ul> <ul><li>METADATA table also stores secondary information, including a log of all events pertaining to each tablet (such as when a server begins serving it). This information is helpful for debugging and performance analysis </li></ul> <p>38. Bigtable: Tablet Representation 39. Bigtable: SSTable 40. Cassandra </p> <ul><li>Data Model </li></ul> <p>41. Cassandra Data Model </p> <ul><li>A table in Cassandra is a distributed multi dimensional map indexed by akey . Thevalueis an object which is highly structured . </li></ul> <ul><li>Every operation under a single row key is atomic per replica no matter how many columns are being read or written into. </li></ul> <ul><li>Columns are grouped together into sets calledcolumn families(very much similar to what happens in the Bigtable system. Cassandra exposes two kinds of columns families,SimpleandSuper columnfamilies. </li></ul> <ul><li>Super column familiescan be visualized as a column family within a column family </li></ul> <p>42. Cassandra Data Model Columns are added and modified dynamically KEY ColumnFamily1Name: MailList Type: Simple Sort: Name Name : tid1 Value : TimeStamp : t1 Name : tid2 Value : TimeStamp : t2 Name : tid3 Value : TimeStamp : t3 Name : tid4 Value : TimeStamp : t4 ColumnFamily2Name: WordList Type: Super Sort: Time Name : aloha ColumnFamily3Name: System Type: Super Sort: Name Name : hint1 Name : hint2 Name : hint3 Name : hint4 C1V1 T1 C2 V2 T2 C3 V3 T3 C4 V4 T4 Name : dude C2V2 T2 C6 V6 T6 Column Families are declared upfront SuperColumns are added and modified dynamically Columns are added and modified dynamically 43. Cassandra Data Model </p> <ul><li>Keyspace </li></ul> <ul><li><ul><li>Uppermost namespace </li></ul></li></ul> <ul><li><ul><li>Typically one per application </li></ul></li></ul> <ul><li><ul><li>~= database </li></ul></li></ul> <ul><li>ColumnFamily </li></ul> <ul><li><ul><li>Associates records of a similar kind </li></ul></li></ul> <ul><li><ul><li>notsamekind, because CFs aresparse tables </li></ul></li></ul> <ul><li><ul><li>Record-level Atomicity </li></ul></li></ul> <ul><li><ul><li>Indexed </li></ul></li></ul> <ul><li>Row </li></ul> <ul><li><ul><li>each row is uniquely identifiable by key </li></ul></li></ul> <ul><li><ul><li>rows group columns and super columns </li></ul></li></ul> <ul><li>Column </li></ul> <ul><li><ul><li>Basic unit of storage </li></ul></li></ul> <p>44. Cassandra Data Model 45. Cassandra Data Model(a example) 46. Cassandra Data Model 47. Cassandra Data Model 48. Cassandra Data Model - Cluster Cluster 49. Cassandra Data Model - Cluster Cluster &gt;Keyspace Partitioners: OrderPreservingPartitioner RandomPartitioner Like an RDBMS schema: Keyspace per application 50. Cassandra Data Model Cluster &gt; Keyspace &gt;Column Family Like an RDBMS table: Separates types in an app 51. Cassandra Data Model SortedMap ... Cluster &gt; Keyspace &gt; Column Family &gt;Row 52. Cassandra Data Model Cluster &gt; Keyspace &gt; Column Family &gt; Row &gt;Column Name -&gt; Value byte[] -&gt; byte[] +version timestamp Not like an RDBMS column: Attribute of the row: each row can contain millions of different columns 53. Cassandra Data Model </p> <ul><li>Any column within a column family is accessed using the convention:</li></ul> <ul><li>column family : column </li></ul> <ul><li>Any column within a column family that is of type super is accessed using the convention:</li></ul> <ul><li>column family :super column : column </li></ul> <p>54. Cassandra </p> <ul><li>Storage Model </li></ul> <p>55. Storage Model Key (CF1 , CF2 , CF3) Commit Log Binary serializedKey ( CF1 , CF2 , CF3 ) Memtable ( CF1) Memtable ( CF2) Memtable ( CF2) FLUSH </p> <ul><li>Data size </li></ul> <ul><li>Number of Objects </li></ul> <ul><li>Lifetime </li></ul> <p>Dedicated Disk &lt; Serialized column family&gt;--- --- --- --- &lt; Serialized column family&gt; BLOCK Index Offset, Offset K 128 Offset K 256 Offset K 384 Offset Bloom Filter (Index in memory) Data file on disk 56. Storage Model-Compactions K1 &lt; Serialized data &gt; K2 &lt; Serialized data &gt; K3 &lt; Serialized data &gt; -- -- -- Sorted K2 &lt; Serialized data &gt; K10 &lt; Serialized data &gt; K30 &lt; Serialized data &gt; -- -- -- Sorted K4 &lt; Serialized data &gt; K5 &lt; Serialized data &gt; K10 &lt; Serialized data &gt; -- -- -- Sorted MERGESORT K1 &lt; Serialized data &gt; K2 &lt; Serialized data &gt; K3 &lt; Serialized data &gt; K4 &lt; Serialized data &gt; K5 &lt; Serialized data &gt; K10 &lt; Serialized data &gt; K30 &lt; Serialized data &gt; Sorted K1Offset K5Offset K30Offset Bloom Filter Loaded in memory Index File Data File D E L E T E D 57. Storage Model - Write </p> <ul><li> Cassandra </li></ul> <ul><li>" " </li></ul> <ul><li><ul><li>RandomPartitioner ( Hash ) </li></ul></li></ul> <ul><li><ul><li>OrderPreservingPartitioner( ) </li></ul></li></ul> <ul><li>Owner , (MemTable) </li></ul> <ul><li> (Commit Log) . </li></ul> <p>58. Storage Model - Write </p> <ul><li> (write through cache) </li></ul> <ul><li> Append , </li></ul> <ul><li> ColumnFamily </li></ul> <ul><li> ( Hinted Handoff) </li></ul> <p>59. Storage Model - Read </p> <ul><li> " " </li></ul> <ul><li> R </li></ul> <ul><li> N - R Read Repair </li></ul> <ul><li> SSTable </li></ul> <ul><li> ( ) </li></ul> <ul><li><ul><li> BloomFilter SSTable </li></ul></li></ul> <ul><li><ul><li> Key/Column index SSTable Key Column </li></ul></li></ul> <ul><li> / </li></ul> <p>60. Cassandra Storage Cassandra Bigtable Memtable SSTable Cassandra commitlog Column Family Memtable Memtable key Memtable SSTable (Write-back Cache) IO IO SSTable Memtable SSTable Cassandra SSTable: 61. Cassandra Storage SSTable Column Family SSTable Column Family SSTable Memtable Column Family Key SSTable SSTable Cassandra Bloom Filter hash key key SSTable 62. Cassandra Storage </p> <ul><li> SSTable Cassandra SSTable SSTable SSTable key Cassandra </li></ul> <ul><li><ul><li>Column Family Name- -Data.db</li></ul></li></ul> <ul><li><ul><li>Column Family Name- -Filter.db</li></ul></li></ul> <ul><li><ul><li>Column Family Name- -index.db</li></ul></li></ul> <ul><li> Data.db SSTable SSTable Sorted Strings Table key key/value index.db key Filter.db Bloom Filter </li></ul> <p>63. Cassandra </p> <ul><li>System Architecture </li></ul> <p>64. System Architecture Content </p> <ul><li><ul><li>Overview </li></ul></li></ul> <ul><li><ul><li>Partitioning </li></ul></li></ul> <ul><li><ul><li>Replication </li></ul></li></ul> <ul><li><ul><li>Membership &amp; Failure Detection </li></ul></li></ul> <ul><li><ul><li>Bootstrapping </li></ul></li></ul> <ul><li><ul><li>Scaling the Cluster </li></ul></li></ul> <ul><li><ul><li>Local Persistence </li></ul></li></ul> <ul><li><ul><li>Communication </li></ul></li></ul> <p>65. System Architecture TombstonesHinted handoffRead repairBootstrapMonitoringAdmin toolsCommit logMemtableSSTableIndexesCompactionMessaging serviceGossip Failure detectionCluster statePartitionerReplicationTop LayerMiddle LayerCore Layer 66. System Architecture </p> <ul><li>Core Layer</li></ul> <ul><li>Middle Layer</li></ul> <ul><li>Top Layer</li></ul> <ul><li>Above the top layer</li></ul> <p>67. System Architecture </p> <ul><li>Core Layer:</li></ul> <ul><li> Messaging Service (async, non-blocking)</li></ul> <ul><li> Gossip Failure detector</li></ul> <ul><li> Cluster membership/state</li></ul> <ul><li> Partitioner(Partitioning scheme) </li></ul> <ul><li> Replication strategy</li></ul> <p>68. System Architecture </p> <ul><li>Middle Layer</li></ul> <ul><li> Commit log</li></ul> <ul><li> Memory-table</li></ul> <ul><li> Compactions</li></ul> <ul><li> Hinted handoff</li></ul> <ul><li> Read repair</li></ul> <ul><li> Bootstrap </li></ul> <p>69. System Architecture </p> <ul><li>Top Layer</li></ul> <ul><li> Key, block, &amp; column indexes</li></ul> <ul><li> Read consistency</li></ul> <ul><li> Touch cache</li></ul> <ul><li> Cassandra API</li></ul> <ul><li> Admin API</li></ul> <ul><li> Read Consistency</li></ul> <p>70. System Architecture </p> <ul><li>Above the top layer:</li></ul> <ul><li> Tools</li></ul> <ul><li> Hadoop integration</li></ul> <ul><li> Search API and Routing</li></ul> <p>71. System Architecture Messaging Layer Cluster Membership Failure Detector Storage Layer Partitioner Replicator Cassandra API Tools 72. Cassandra - Architecture 73. System Architecture </p> <ul><li>The architecture of a storage system needs to have the following characteristics: </li></ul> <ul><li>scalable and robust solutionsfor load balancing </li></ul> <ul><li>membership and failure detection </li></ul> <ul><li>failure recovery </li></ul> <ul><li>replica synchronization </li></ul> <ul><li>overload handling </li></ul> <ul><li>state transfer </li></ul> <ul><li>concurrency and job scheduling </li></ul> <ul><li>request marshalling </li></ul> <ul><li>request routing </li></ul> <ul><li>system monitoring and alarming </li></ul> <ul><li>conguration management </li></ul> <p>74. System Architecture </p> <ul><li>we will focus on the core distributed systems techniques used in Cassandra:</li></ul> <ul><li>partitioning </li></ul> <ul><li>replication </li