summary of "amazon's dynamo" for the 2nd nosql summer reading in tokyo
TRANSCRIPT
Dynamo: Amazon’s Highly Available Key-Value Store by DeCandia et al. (2007)
Gemini Mobile Technologies, Inc.
NOSQL Tokyo Reading Group
(http://nosqlsummer.org/city/tokyo)
August 4, 2010
Tags: #dynamo #amazon #nosql
2010/7/28 Gemini Mobile Technologies, Inc. 1
Dynamo: Amazon’s Highly Available Key-value Store
Authors: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan
Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter
Vosshall and Werner Vogels.
Abstract: This paper presents the design and implementation of Dynamo, a highly
available key-value storage system that some of Amazon’s core services use to
provide an “always-on” experience. To achieve this level of availability, Dynamo
sacrifices consistency under certain failure scenarios. It makes extensive use of
object versioning and application-assisted conflict resolution in a manner that
provides a novel interface for developers to use.
Appeared in: Proceedings of 21st ACM SIGOPS Symposium on Operating Systems
Principles, Stevenson, WA, October 2007.
http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 2
Service-Oriented Architecture of Amazon’s Platform
• “highly decentralized, loosely coupled,
service-oriented architecture
consisting of hundreds of services”
• High scale: Data and requests.
• Availability is most important.
• Low latency requirement.
• Data store is primarily for state
management.
• Service level agreements (SLAs) (e.g.,
Under 300ms for 99.9% of requests at
500 requests/s).
• Example: Managing shopping carts.
Write /read and available across
multiple data centers.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 3
Dynamo Key Features
• Highly-available: 99.9995% of requests.
• “Always writeable”: Accept write requests during failure scenarios.
• “Eventually consistent”. Tradeoff strong consistency in favor of higher availability, low
latency.
• Scale-out. Incrementally scalable to hundreds of nodes.
• Reliable: No data loss.
• Quorum: Configurable N (# of replicas), R (# of replicas to read), W (# of replicas to
write).
• Key-value data model.
• Symmetric nodes.
• No special functions or “master” nodes.
• “Full membership” model: Each node is aware of the data at peers.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 4
API: Two operations
• get(key)
• Determines the replica nodes associated with key.
• Returns object or list of objects with conflicting versions and context.
• put(key, context, object)
• Determines the replica nodes associated with key.
• Write object and context.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 5
Partitioning Algorithm: Consistent Hashing
• Each node is assigned a random
position on ring.
• Key k is hashed to fixed circular
space.
• Nodes are assigned by walking
clockwise from hash location.
• Example:
• Nodes A, B, C, D, E, F, G assigned to
ring.
• Hash(k) is between A and B.
• Since 3 replicas, choose next 3
nodes on ring (i.e., B, C, D).
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 6
A
B
C
DE
F
G
Hash(k)
Node
assignment
Consistent Hashing
• Key advantage: Adding, deleting, re-allocating nodes is cheap. It affects only
immediate neighbor node keys.
• Hash function
• Locality
• Load distribution.
• “Virtual nodes”. Nodes are assigned multiple points on the ring.
• Higher-performance nodes are assigned more keyspace.
• If node is unavailable, load is shed to multiple alternative nodes.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 7
Replication
• Quorum: R + W > N guarantees consistency.
• N: Number of replicas
• R: Number of replicas to read from.
• W: Number of replicas to write to.
• Latency is determined by the slowest of the R replicas for read, W replicas for write.
• Read-optimized system: R=1, W=N
• Write-optimized system: R=N, W=1
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 8
Data Versioning: Vector Clocks
• Eventual consistency
• Put() call may return to caller before update
has been applied at all replicas.
• Get() may then return “old” data.
• Vector Clocks attempt to reconcile multiple
data versions.
• List of (node, counter) pairs.
• Each put() operation adds a leaf node to
extend the previous vector clock graph.
• Vector clock data (all leaves) are part of
“context” of get() and put().
• Reconciliation of versions can be done by
either client or server.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 9
Routing Client Requests
• Definitions
• “preference list”: (key, node) routing information.
• “coordinator” node: The first of N replicas for a specific key.
• Client routing: Two options
• Client can have routing data to route directly to key’s coordinator node.
• Any node can receive client’s request for any key, check preference list, and forward to
key’s coordinator node.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 10
Put() and Get() operations
• Put()
• Coordinator generates vector clock for new version and writes data and vector clock
locally.
• Coordinator sends put() to N highest-ranked reachable nodes in preference list.
• If at least W-1 nodes respond, then write is successful
• Get()
• Coordinator request all data versions from N highest-ranked reachable nodes in
preference list.
• Waits for at least R responses.
• Returns data to client. Could be multiple versions.
• “Read repair”: Reconciles divergent versions, if possible, and writes update.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 11
Handling Temporary Failures: Sloppy Quorum & Hinted Handoff
• Sloppy Quorum (N, R, W)
• All operations performed on first N healthy nodes.
• Hinted Handoff
• If node is unavailable, put() request is sent and written to another node. This data is
called a “hinted replica”.
• The hinted replica’s context metadata includes what the original node was.
• As a background job, upon detecting that original node has recovered, the hinted replica
is written to original node.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 12
Handling permanent failures: Replica synchronization
• An “anti-entropy” protocol is used to keep the replicas synchronized.
• Merkle trees
• Leaves are hashes of the values of individual keys.
• Parent nodes are hashes of their respective children.
• Each node keeps Merkle tree of its keys and values. Then can easily determine if
replicas have same data.
• If root nodes of two trees are same, then keys/values are same.
• Otherwise, can compare subtrees to find set of data that is out-of-sync.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 13
Node Membership and Failure Detection
• Membership
• Gossip-based protocol to propagate node additions/removals to ring. Each node contacts
a random peer every second, and they exchange data on partitioning and placement (i.e.,
preference lists).
• “Seeds” are a subset of nodes (configurable) known to all nodes. Eventually each node
will gossip with a seed node. This reduces chances of partition.
• Local failure detection
• To identify which nodes are unreachable, each node maintains its own list of available
nodes. No need for a globally consistent view.
• If a node does not respond to a message, it is marked as unreachable.
• Periodically, the node retries the unreachable nodes.
• Additionally, explicit node join/leave commands are used.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 14
Implementation
• 3 main software components:
• Request coordination.
• SEDA: Staged, Event-Driven Architecture. State machines and queues.
• Coordinator for a write is chosen to be the node that responded fastest to previous read request.
• Membership and failure detection
• Local persistence engine. Options are BDB, MySQL, in-memory DB.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 15
Experiences & Lessons Learned
• Business logic specific reconciliation.
• Client application (not Dynamo) reconciles divergent data versions.
• Example: Shopping Cart. Client application merges cart contents.
• Timestamp-based reconciliation
• Dynamo uses “last write wins” reconciliation based on latest timestamp (client-
supplied(?)).
• Example: User session information.
• High-performance read engine
• R=1, W=N
• Example: Product catalog
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 16
Example Latency Results
• Daily pattern of request rate.
• Writes are synced to disk, so slower than reads.
• Buffering writes reduces variability at 99.9 percentile.
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 17
Uniform load distribution
• Goal: Balance each node’s request load.
• Comparing 3 strategies:
1. T random tokens per node and partition by token value.
2. T random tokens per node and equal-sized partitions.
3. Q/S random tokens per node and equal-sized partitions.
• Advantage of #2 and #3 is to decouple data partitioning and placement.
• Partition data according to fixed ranges.2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 18
Bigtable vs. Dynamo Comparison
2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 19
Bigtable Dynamo
Data model Column-oriented Key-Value
Storage layer Commit log, Memtable, SSTables
BDB, cache
Node types Master node and tablet nodes Symmetric nodes
Consistency Strong consistency Configurable
Transactions Per-row None
Data locality Sorted keys. Good for scans. Depends on hash function
????
????
????
????