summary of "amazon's dynamo" for the 2nd nosql summer reading in tokyo

20
Dynamo: Amazon’s Highly Available Key-Value Store by DeCandia et al. (2007) Gemini Mobile Technologies, Inc. NOSQL Tokyo Reading Group (http://nosqlsummer.org/city/tokyo ) August 4, 2010 Tags: #dynamo #amazon #nosql 2010/7/28 Gemini Mobile Technologies, Inc. 1

Upload: cloudian

Post on 14-Jul-2015

5.782 views

Category:

Technology


2 download

TRANSCRIPT

Dynamo: Amazon’s Highly Available Key-Value Store by DeCandia et al. (2007)

Gemini Mobile Technologies, Inc.

NOSQL Tokyo Reading Group

(http://nosqlsummer.org/city/tokyo)

August 4, 2010

Tags: #dynamo #amazon #nosql

2010/7/28 Gemini Mobile Technologies, Inc. 1

Dynamo: Amazon’s Highly Available Key-value Store

Authors: Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan

Kakulapati, Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter

Vosshall and Werner Vogels.

Abstract: This paper presents the design and implementation of Dynamo, a highly

available key-value storage system that some of Amazon’s core services use to

provide an “always-on” experience. To achieve this level of availability, Dynamo

sacrifices consistency under certain failure scenarios. It makes extensive use of

object versioning and application-assisted conflict resolution in a manner that

provides a novel interface for developers to use.

Appeared in: Proceedings of 21st ACM SIGOPS Symposium on Operating Systems

Principles, Stevenson, WA, October 2007.

http://s3.amazonaws.com/AllThingsDistributed/sosp/amazon-dynamo-sosp2007.pdf

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 2

Service-Oriented Architecture of Amazon’s Platform

• “highly decentralized, loosely coupled,

service-oriented architecture

consisting of hundreds of services”

• High scale: Data and requests.

• Availability is most important.

• Low latency requirement.

• Data store is primarily for state

management.

• Service level agreements (SLAs) (e.g.,

Under 300ms for 99.9% of requests at

500 requests/s).

• Example: Managing shopping carts.

Write /read and available across

multiple data centers.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 3

Dynamo Key Features

• Highly-available: 99.9995% of requests.

• “Always writeable”: Accept write requests during failure scenarios.

• “Eventually consistent”. Tradeoff strong consistency in favor of higher availability, low

latency.

• Scale-out. Incrementally scalable to hundreds of nodes.

• Reliable: No data loss.

• Quorum: Configurable N (# of replicas), R (# of replicas to read), W (# of replicas to

write).

• Key-value data model.

• Symmetric nodes.

• No special functions or “master” nodes.

• “Full membership” model: Each node is aware of the data at peers.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 4

API: Two operations

• get(key)

• Determines the replica nodes associated with key.

• Returns object or list of objects with conflicting versions and context.

• put(key, context, object)

• Determines the replica nodes associated with key.

• Write object and context.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 5

Partitioning Algorithm: Consistent Hashing

• Each node is assigned a random

position on ring.

• Key k is hashed to fixed circular

space.

• Nodes are assigned by walking

clockwise from hash location.

• Example:

• Nodes A, B, C, D, E, F, G assigned to

ring.

• Hash(k) is between A and B.

• Since 3 replicas, choose next 3

nodes on ring (i.e., B, C, D).

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 6

A

B

C

DE

F

G

Hash(k)

Node

assignment

Consistent Hashing

• Key advantage: Adding, deleting, re-allocating nodes is cheap. It affects only

immediate neighbor node keys.

• Hash function

• Locality

• Load distribution.

• “Virtual nodes”. Nodes are assigned multiple points on the ring.

• Higher-performance nodes are assigned more keyspace.

• If node is unavailable, load is shed to multiple alternative nodes.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 7

Replication

• Quorum: R + W > N guarantees consistency.

• N: Number of replicas

• R: Number of replicas to read from.

• W: Number of replicas to write to.

• Latency is determined by the slowest of the R replicas for read, W replicas for write.

• Read-optimized system: R=1, W=N

• Write-optimized system: R=N, W=1

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 8

Data Versioning: Vector Clocks

• Eventual consistency

• Put() call may return to caller before update

has been applied at all replicas.

• Get() may then return “old” data.

• Vector Clocks attempt to reconcile multiple

data versions.

• List of (node, counter) pairs.

• Each put() operation adds a leaf node to

extend the previous vector clock graph.

• Vector clock data (all leaves) are part of

“context” of get() and put().

• Reconciliation of versions can be done by

either client or server.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 9

Routing Client Requests

• Definitions

• “preference list”: (key, node) routing information.

• “coordinator” node: The first of N replicas for a specific key.

• Client routing: Two options

• Client can have routing data to route directly to key’s coordinator node.

• Any node can receive client’s request for any key, check preference list, and forward to

key’s coordinator node.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 10

Put() and Get() operations

• Put()

• Coordinator generates vector clock for new version and writes data and vector clock

locally.

• Coordinator sends put() to N highest-ranked reachable nodes in preference list.

• If at least W-1 nodes respond, then write is successful

• Get()

• Coordinator request all data versions from N highest-ranked reachable nodes in

preference list.

• Waits for at least R responses.

• Returns data to client. Could be multiple versions.

• “Read repair”: Reconciles divergent versions, if possible, and writes update.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 11

Handling Temporary Failures: Sloppy Quorum & Hinted Handoff

• Sloppy Quorum (N, R, W)

• All operations performed on first N healthy nodes.

• Hinted Handoff

• If node is unavailable, put() request is sent and written to another node. This data is

called a “hinted replica”.

• The hinted replica’s context metadata includes what the original node was.

• As a background job, upon detecting that original node has recovered, the hinted replica

is written to original node.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 12

Handling permanent failures: Replica synchronization

• An “anti-entropy” protocol is used to keep the replicas synchronized.

• Merkle trees

• Leaves are hashes of the values of individual keys.

• Parent nodes are hashes of their respective children.

• Each node keeps Merkle tree of its keys and values. Then can easily determine if

replicas have same data.

• If root nodes of two trees are same, then keys/values are same.

• Otherwise, can compare subtrees to find set of data that is out-of-sync.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 13

Node Membership and Failure Detection

• Membership

• Gossip-based protocol to propagate node additions/removals to ring. Each node contacts

a random peer every second, and they exchange data on partitioning and placement (i.e.,

preference lists).

• “Seeds” are a subset of nodes (configurable) known to all nodes. Eventually each node

will gossip with a seed node. This reduces chances of partition.

• Local failure detection

• To identify which nodes are unreachable, each node maintains its own list of available

nodes. No need for a globally consistent view.

• If a node does not respond to a message, it is marked as unreachable.

• Periodically, the node retries the unreachable nodes.

• Additionally, explicit node join/leave commands are used.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 14

Implementation

• 3 main software components:

• Request coordination.

• SEDA: Staged, Event-Driven Architecture. State machines and queues.

• Coordinator for a write is chosen to be the node that responded fastest to previous read request.

• Membership and failure detection

• Local persistence engine. Options are BDB, MySQL, in-memory DB.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 15

Experiences & Lessons Learned

• Business logic specific reconciliation.

• Client application (not Dynamo) reconciles divergent data versions.

• Example: Shopping Cart. Client application merges cart contents.

• Timestamp-based reconciliation

• Dynamo uses “last write wins” reconciliation based on latest timestamp (client-

supplied(?)).

• Example: User session information.

• High-performance read engine

• R=1, W=N

• Example: Product catalog

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 16

Example Latency Results

• Daily pattern of request rate.

• Writes are synced to disk, so slower than reads.

• Buffering writes reduces variability at 99.9 percentile.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 17

Uniform load distribution

• Goal: Balance each node’s request load.

• Comparing 3 strategies:

1. T random tokens per node and partition by token value.

2. T random tokens per node and equal-sized partitions.

3. Q/S random tokens per node and equal-sized partitions.

• Advantage of #2 and #3 is to decouple data partitioning and placement.

• Partition data according to fixed ranges.2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 18

Bigtable vs. Dynamo Comparison

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 19

Bigtable Dynamo

Data model Column-oriented Key-Value

Storage layer Commit log, Memtable, SSTables

BDB, cache

Node types Master node and tablet nodes Symmetric nodes

Consistency Strong consistency Configurable

Transactions Per-row None

Data locality Sorted keys. Good for scans. Depends on hash function

????

????

????

????

Dynamo Family Tree

• Voldemort: LinkedIn

• Kai: Erlang implementation.

• Dynomite: Erlang implementation.

• Cassandra: “Open Source Bigtable + Dynamo”

• Hibari: Consistent hashing, key-value data model.

2010/7/28 Gemini Mobile Technologies, Inc. All rights reserved. 20