cs 525 advanced distributed systems spring 2014
DESCRIPTION
CS 525 Advanced Distributed Systems Spring 2014. Indranil Gupta (Indy) Feb 11, 2014 Key-value/NoSQL Stores Lecture 7. Based mostly on Cassandra NoSQL presentation Cassandra 1.0 documentation at datastax.com Cassandra Apache project wiki HBase. 2014, I . Gupta. Cassandra. - PowerPoint PPT PresentationTRANSCRIPT
1
CS 525 Advanced Distributed Systems
Spring 2014
Indranil Gupta (Indy)
Feb 11, 2014
Key-value/NoSQL Stores
Lecture 7
2014, I. Gupta
Based mostly on •Cassandra NoSQL presentation•Cassandra 1.0 documentation at datastax.com•Cassandra Apache project wiki•HBase
2
Cassandra
• Originally designed at Facebook
• Open-sourced
• Some of its myriad users:
• With this many users, one would think– Its design is very complex
– Let’s find out!
3
Why Key-value Store?
• (Business) Key -> Value
• (twitter.com) tweet id -> information about tweet
• (amazon.com) item number -> information about it
• (kayak.com) Flight number -> information about flight, e.g., availability
• (yourbank.com) Account number -> information about it
• Search is usually built on top of a key-value store
4
Isn’t that just a database? • Yes, sort of
• Relational Databases (RDBMSs) have been around for ages
• MySQL is the most popular among them
• Data stored in tables
• Schema-based, i.e., structured tables
• Queried using SQL (Structured Query Language)
SQL queries: SELECT user_id from users WHERE username = “jbellis”
Example’s Source
5
Mismatch with today’s workloads
• Data: Large and unstructured
• Lots of random reads and writes
• Foreign keys rarely needed
• Need– Speed
– Avoid Single point of Failure (SPoF)
– Low TCO (Total cost of operation) and fewer sysadmins
– Incremental Scalability
– Scale out, not up: use more machines that are off the shelf (COTS), not more powerful machines
6
CAP Theorem
• Proposed by Eric Brewer (Berkeley)
• Subsequently proved by Gilbert and Lynch
• In a distributed system you can satisfy at
most 2 out of the 3 guarantees1. Consistency: all nodes see same data at any time, or
reads return latest written value
2. Availability: the system allows operations all the time
3. Partition-tolerance: the system continues to work in spite of network partitions
• Cassandra– Eventual (weak) consistency, Availability, Partition-tolerance
• Traditional RDBMSs– Strong consistency over availability under a partition
7
CAP Tradeoff
• Starting point for NoSQL Revolution
• Conjectured by Eric Brewer in 2000, proved by Gilbert and Lynch in 2002
• A distributed storage system can achieve at most two of C, A, and P.
• When partition-tolerance is important, you have to choose between consistency and availability
Consistency
Partition-tolerance Availability
RDBMSs
Cassandra, RIAK, Dynamo, Voldemort
HBase, HyperTable,BigTable, Spanner
7
8
Eventual Consistency
• If all writers stop (to a key), then all its values (replicas) will converge eventually.
• If writes continue, then system always tries to keep converging.
– Moving “wave” of updated values lagging behind the latest values sent by clients, but always trying to catch up.
• May return stale values to clients (e.g., if many back-to-back writes).
• But works well when there a few periods of low writes. System converges quickly.
9
Cassandra – Data Model • Column Families:
– Like SQL tables
– but may be unstructured (client-specified)
– Can have index tables
• “Column-oriented databases”/ “NoSQL”
– Columns stored together, rather than rows
– No schemas
» Some columns missing from some entries
– NoSQL = “Not Only SQL”
– Supports get(key) and put(key, value) operations
– Often write-heavy workloads
10
Let’s go Inside Cassandra: Key -> Server
Mapping• How do you decide which server(s) a key-value
resides on?
11
11
N80
0Say m=7
N32
N45
Backup replicas forkey K13
Cassandra uses a Ring-based DHT but without routingKey->server mapping is the “Partitioning Function”
N112
N96
N16
Read/write K13
Primary replica forkey K13
(Remember this?)
Coordinator (typically one per DC)
12
Writes • Need to be lock-free and fast (no reads or disk
seeks)
• Client sends write to one front-end node in Cassandra cluster
– Front-end = Coordinator, assigned per key
• Which (via Partitioning function) sends it to all replica nodes responsible for key
– Always writable: Hinted Handoff mechanism
» If any replica is down, the coordinator writes to all other replicas, and keeps the write locally until down replica comes back up.
» When all replicas are down, the Coordinator (front end) buffers writes (for up to a few hours).
– Provides Atomicity for a given key (i.e., within a ColumnFamily)
• One ring per datacenter– Per-DC coordinator elected to coordinate with other DCs
– Election done via Zookeeper, which runs a Paxos variant
13
Writes at a replica node
On receiving a write
•1. log it in disk commit log
•2. Make changes to appropriate memtables– In-memory representation of multiple key-value pairs
•Later, when memtable is full or old, flush to disk– Data File: An SSTable (Sorted String Table) – list of key value pairs,
sorted by key
– Index file: An SSTable of (key, position in data sstable) pairs
– And a Bloom filter (for efficient search) – next slide
•Compaction: Data udpates accumulate over time and SStables and logs need to be compacted
– Merge SSTables, e.g., by merging updates for a key
– Run periodically and locally at each server
•Reads need to touch log and multiple SSTables– A row may be split over multiple SSTables
– Reads may be slower than writes
14
Bloom Filter• Compact way of representing a set of items
• Checking for existence in set is cheap
• Some probability of false positives: an item not in set may check true as being in set
• Never false negativesLarge Bit Map
01
23
69
127
111
Key-K
Hash1
Hash2
Hashk
On insert, set all hashed bits.
On check-if-present, return true if all hashed bits set.•False positives
False positive rate low•k=4 hash functions•100 items• 3200 bits•FP rate = 0.02%
.
.
15
Deletes and Reads
• Delete: don’t delete item right away– Add a tombstone to the log
– Compaction will eventually remove tombstone and delete item
• Read: Similar to writes, except– Coordinator can contacts a number of replicas (e.g., in same rack) specified by
consistency level
» Forwards read to replicas that have responded quickest in past
» Returns latest timestamp value
– Coordinator also fetches value from multiple replicas
» check consistency in the background, initiating a read-repair if any two values are different
» Brings all replicas up to date
– A row may be split across multiple SSTables => reads need to touch multiple SSTables => reads slower than writes (but still fast)
16
Cassandra uses Quorums• Quorum = way of selecting sets so that any pair of
sets intersect– E.g., any arbitrary set with at least Q=N/2 +1 nodes
– Where N = total number of replicas for this key
• Reads– Wait for R replicas (R specified by clients)
– In the background, check for consistency of remaining N-R replicas, and initiate read repair if needed
• Writes come in two default flavors– Block until quorum is reached
– Async: Write to any node
• R = read replica count, W = write replica count
• If W+R > N and W > N/2, you have consistency, i.e., each read returns the latest written value
• Reasonable: (W=1, R=N) or (W=N, R=1) or (W=Q, R=Q)
17
Cassandra Consistency Levels
• In reality, a client can choose one of these levels for a read/write operation:
– ANY: any node (may not be replica)
– ONE: at least one replica
– Similarly, TWO, THREE
– QUORUM: quorum across all replicas in all datacenters
– LOCAL_QUORUM: in coordinator’s DC
– EACH_QUORUM: quorum in every DC
– ALL: all replicas all DCs
• For Write, you also have – SERIAL: claims to implement linearizability for transactions
18
Cluster Membership – Gossip-Style
1
1 10120 66
2 10103 62
3 10098 63
4 10111 65
2
43
Protocol:
•Nodes periodically gossip their membership list
•On receipt, the local membership list is updated, as shown
•If any heartbeat older than Tfail, node is marked as failed
1 10118 64
2 10110 64
3 10090 58
4 10111 65
1 10120 70
2 10110 64
3 10098 70
4 10111 65
Current time : 70 at node 2
(asynchronous clocks)
AddressHeartbeat Counter
Time (local)
Fig and animation by: Dongyun Jin and Thuy Ngyuen
Cassandra uses gossip-based cluster membership
19
Gossip-Style Failure Detection
• If the heartbeat has not increased for more than Tfail seconds (according to local time), the member is considered failed
• But don’t delete it right away
• Wait an additional Tfail seconds, then delete the member from the list
– Why?
20
• What if an entry pointing to a failed process is deleted right after Tfail (= 24) seconds?
• Fix: remember for another Tfail
• Ignore gossips for failed members – Don’t include failed members in gossip messages
1
1 10120 66
2 10103 62
3 10098 55
4 10111 65
2
43
1 10120 66
2 10110 64
3 10098 50
4 10111 65
1 10120 66
2 10110 64
4 10111 65
1 10120 66
2 10110 64
3 10098 75
4 10111 65
Current time : 75 at process 2
Gossip-Style Failure Detection
21
Cluster Membership, contd.
• Suspicion mechanisms to adaptively set the timeout
• Accrual detector: Failure Detector outputs a value (PHI) representing suspicion
• Apps set an appropriate threshold
• PHI = 5 => 10-15 sec detection time
• PHI calculation for a member– Inter-arrival times for gossip messages
– PHI(t) = - log(CDF or Probability(t_now – t_last))/log 10
– PHI basically determines the detection timeout, but takes into account historical inter-arrival time variations for gossiped heartbeats
22
Data Placement Strategies
• Replication Strategy: two options:1. SimpleStrategy
2. NetworkTopologyStrategy
1.SimpleStrategy: uses the Partitioner1. RandomPartitioner: Chord-like hash partitioning
2. ByteOrderedPartitioner: Assigns ranges of keys to servers.
» Easier for range queries (e.g., Get me all twitter users starting with [a-b])
2.NetworkTopologyStrategy: for multi-DC deployments
– Two replicas per DC: allows a consistency level of ONE
– Three replicas per DC: allows a consistency level of LOCAL_QUORUM
– Per DC
» First replica placed according to Partitioner
» Then go clockwise around ring until you hit different rack
23
Snitches
• Maps: IPs to racks and DCs. Configured in cassandra.yaml config file
• Some options:– SimpleSnitch: Unaware of Topology (Rack-unaware)
– RackInferring: Assumes topology of network by octet of server’s IP address
» 101.201.301.401 = x.<DC octet>.<rack octet>.<node octet>
– PropertyFileSnitch: uses a config file
– EC2Snitch: uses EC2.
» EC2 Region = DC
» Availability zone = rack
• Other snitch options available
24
Vs. SQL
• MySQL is one of the most popular (and has been for a while)
• On > 50 GB data
• MySQL – Writes 300 ms avg
– Reads 350 ms avg
• Cassandra – Writes 0.12 ms avg
– Reads 15 ms avg
25
Cassandra Summary
• While RDBMS provide ACID (Atomicity Consistency Isolation Durability)
• Cassandra provides BASE– Basically Available Soft-state Eventual Consistency
– Prefers Availability over consistency
• Other NoSQL products– MongoDB, Riak (look them up!)
• Next: HBase– Prefers (strong) Consistency over Availability
26
HBase
• Google’s BigTable was first “blob-based” storage system
• Yahoo! Open-sourced it -> HBase
• Major Apache project today
• Facebook uses HBase internally
• API– Get/Put(row)
– Scan(row range, filter) – range queries
– MultiPut
27
HBase Architecture
Source: http://www.larsgeorge.com/2009/10/hbase-architecture-101-storage.html
Small group of servers runningZab, a consensus protocol (Paxos-like)
HDFS
28
HBase Storage hierarchy
• HBase Table– Split it into multiple regions: replicated across servers
» One Store per combination of ColumnFamily (subset of columns with similar query patterns) + region
• Memstore for each Store: in-memory updates to Store; flushed to disk when full
– StoreFiles for each store for each region: where the data lives
- Blocks
• HFile– SSTable from Google’s BigTable
29
HFile
Source: http://blog.cloudera.com/blog/2012/06/hbase-io-hfile-input-output/
SSN:000-01-2345
(For a census table example)
DemographicEthnicity
30
Strong Consistency: HBase Write-Ahead Log
Write to HLog before writing to MemStoreThus can recover from failure
Source: http://www.larsgeorge.com/2010/01/hbase-architecture-101-write-ahead-log.html
31
Log Replay
• After recovery from failure, or upon bootup (HRegionServer/HMaster)
– Replay any stale logs (use timestamps to find out where the database is w.r.t. the logs)
– Replay: add edits to the MemStore
• Keeps one HLog per HRegionServer rather than per region
– Avoids many concurrent writes, which on the local file system may involve many disk seeks
32
Cross-data center replication
HLog
Zookeeper is actually a file system for control information1. /hbase/replication/state2. /hbase/replication/peers /<peer cluster number>3. /hbase/replication/rs/<hlog>
33
Wrap UpWrap Up
• Key-value stores and NoSQL trade off between Availability and Consistency, while providing partition-tolerance
• Presentations and reviews start this Thursday– Presenters: please email me slides at least 24 hours before
your presentation time
– Everyone: please read instructions on website
– If you have not signed up for a presentation slot, you’ll need to do all reviews.
– Project meetings starting soon.