dynamo: amazon’s highly available key-value store giuseppe decandia et al. [amazon.com] jagrut...

26
Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma [email protected] CSCI-572 (Prof. Chris Mattmann) 20-Jul-2010

Upload: junior-hudson

Post on 26-Dec-2015

222 views

Category:

Documents


3 download

TRANSCRIPT

Page 1: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Dynamo: Amazon’s Highly Available Key-value Store

Giuseppe DeCandia et al.[Amazon.com]

Jagrut [email protected] (Prof. Chris Mattmann)20-Jul-2010

Page 2: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Outline of Talk

• Motivation (1)• Contribution (1)• Context (1)• Background (3)• Related Work (2)• System Architecture (7)• Implementation (1)• Experiences, Results & Lessons Learnt (4)• Conclusion (1)• Pros (1)• Cons (1)• Questions (1)

2

Page 3: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Motivation

3

Tens of millions of customers

Tens of millions of customers

Tens of thousands of servers

Tens of thousands of servers Globally distributed data

centers24 * 7 * 365 operations

Globally distributed data centers

24 * 7 * 365 operations

Performance

Performance

ReliabilityReliability

EfficiencyEfficiency

ScalabilityScalability

Financial consequenc

es

Financial consequenc

es

Customer Trust

Customer Trust

DATAMGMTDATAMGMT

Page 4: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Contribution

• Evaluation of how different techniques can be combined to provide a highly-available system

• Demonstration of how a consistent storage system (like Dynamo) can be used in production environment with demanding applications

• Provision of tuning methods to meet requirements of production systems with very strict performance demands

4

Page 5: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Context

• Amazon’s e-commerce platform• Highly de-centralized• Loosely coupled• Service-oriented architecture• Hundreds of services• Millions of components• Failure is a way of life

• Critical requirement• Always available storage

• Storage techniques• S3 (Amazon Simple Storage Service)• Dynamo

• Highly available and scalable distributed data store for Amazon’s platform

• Provides primary-key only interface for selected applications (e.g. shopping cart)

• Combined multiple, high-performance techniques & algorithms• Excellent performance in real-world scenarios

5

Page 6: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Background (1 of 3)

• E-commerce platform services: Stateless & Stateful• Relational Databases an over-kill for stateful lookups by

primary key• Dynamo:

• Simple key/value interface• Highly available• Efficient in resource usage• Scalable

• Each service that uses Dynamo runs its own Dynamo instances• Dynamo’s target applications:

• Store small-sized objects (<1 MB)• Operate with weaker consistency if this gives high availability

• Simple read-write to a data item uniquely identified by a key• No query operations span multiple data items• Services use Dynamo to give priority to latency & throughput• Amazon’s SLAs are expressed and measured at the 99.9th

percentile of the distribution (in contrast to common industry approach of using average, median and expected variance)

6

Page 7: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Background (2 of 3)

Assumptions About Dynamo

• Used only by Amazon’s internal services

• Operation environment is non-hostile

• There are no security-related requirements (e.g. authentication, authorization)

• Each service uses its distinct instance of Dynamo

• Dynamo’s initial design targets a scale of up to hundreds of storage hosts

7

Page 8: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Background (3 of 3)

SOA of Amazon’s platform

8

Dynamo Design Considerations

• Conflict resolution between replication & consistency ?• Eventually consistent data store

• When to resolve update conflicts ?• “always writeable” data store

• Who performs conflict resolution?• Both data store & application

allowed

• Incremental scalability at node-level• Symmetry among nodes• Favors decentralization• Capable of exploiting infrastructure

heterogeneity

Page 9: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Related Work (1 of 2)

• Peer to Peer Systems• Tackle problems of data storage and distribution• Only support flat namespaces• Unstructured P2P: Freenet, Gnutella

• Search query floods network• Structured P2P systems: Pastry, Chord, Oceanstore, PAST

• Employ globally consistent query routing protocol• Bounded number of hops• Maintain local routing tables• Provide rich storage services with conflict resolution

• Distributed File Systems and Databases• Support both flat & hierarchical namespaces• Ficus, Coda: high availability at expense of consistency• Farsite: high availability and scalability using replication• Google File System: master server, chunkservers• Bayou: Distributed RDBMS, disconnected operations• Antiquity: Wide-area distributed storage system• BigTable: Distributed storage system for structured data

9

Page 10: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Related Work (2 of 2)

Dynamo Vs Other Systems

1. Targeted mainly at apps that need an “always writeable” data store

2. Built for an infrastructure within a single administrative domain where all nodes are assumed to be trusted

3. Applications using Dynamo do not require support for hierarchical namespaces or complex relational schema

4. Built for latency sensitive applications that require at least 99.9% of read and write operations to be performed within a few hundred milliseconds.

5. Avoids routing requests through multiple nodes. Hence, similar to a zero-hop Distributed Hash Table.

10

Page 11: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

System Architecture (1 of 7)

Problem Technique Advantage

Partitioning Consistent Hashing Incremental Scalability

High Availability for writes

Vector clocks with reconciliation during reads

Version size is decoupled from update rates

Handling temporary failures

Sloppy Quorum and hinted handoff

Provides high availability and durability guarantee when some of the replicas are not available

Recovering from permanent failures

Anti-entropy using Merkle trees

Synchronizes divergent replicas in the background

Membership and failure detection

Gossip-based membership protocol and failure detection

Preserves symmetry and avoids having a centralized registry for storing membership and node liveness information

11

List Of Techniques Used By Dynamo & Their Advantages

Page 12: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

System Architecture (2 of 7)

System Interface

• get (key)• locates the object replicas associated with key in the storage system• Returns a single object/list of objects with conflicting versions +

context

• put(key, context, object)• Determines where the replicas of the object should be placed based on

the associated key• Writes replicas to disk

• context• encodes system metadata about object• includes additional information (e.g. object version)

• key, object: considered as an opaque array of bytes• MD5 hash (key) -> 128-bit identifier, used to determine the

storage nodes that are responsible for serving the key12

Page 13: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

System Architecture (3 of 7)

Partitioning Algorithm

• Provides mechanism to dynamically partition the data over the set of nodes (i.e. storage hosts)

• Uses variant of consistent hashing (output range of a hash function is treated as a fixed circular space or ‘ring’ - largest hash value wraps around to the smallest hash value)• Advantage: departure or arrival of a node only affects its

immediate neighbors• Limitation 1: leads to non-uniform data and load distribution• Limitation 2: oblivious to heterogeneity in the performance of

nodes

• (single node) -> multiple points in the ring i.e. virtual nodes• Advantages of virtual nodes:

• Graceful handling of failure of a node• Easy accommodation of a new node• Heterogeneity in physical infrastructure can be exploited

13

Page 14: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

System Architecture (4 of 7)

14

Replication

• Each data item replicated at N hosts• N is configured per-instance• Each node is responsible for the region of the ring between it and its Nth

predecessor• Preference list: List of nodes responsible for storing a particular key

Data Versioning

• Eventual consistency: Allows updates to be propagated to all replicas asynchronously

• put() may return to caller before update has been applied at all replicas• get() may return an object that does not have the latest updates• Multiple versions of an object can be present in the system at same time• syntactic reconciliation: performed by system• semantic reconciliation: performed by client• vector clock: (node, counter) pair. Used for capturing causality between

different versions of the same object. One vector clock per version per object.

Page 15: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

System Architecture (5 of 7)

Execution of get() and put() Operations

• Any storage node in Dynamo is eligible to receive client get() and put() operations for any key

• Client can select a node using:• generic load balancer• partition-aware client library

• Coordinator: • node handing read or write operation• typically, first among the top N nodes in the preference list

• Consistency protocol used to maintain consistency among replicas. Two key configurable values are:• R: min. no. of nodes that must participate in a successful read operation• W: min. no. of nodes that must participate in a successful write

operation• R + W > N is preferable

15

Page 16: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

System Architecture (6 of 7)

Handling Failures: Hinted Handoff

• Mechanism to ensure that the read and write operations are not failed due to temporary node or network failures.

• All read and write operations are performed on the first N healthy nodes from the preference list, which may NOT always be the first N nodes encountered while walking the consistent hashing ring.

• Each object is replicated across multiple data centers, which are connected through high-speed network links.

Handling Permanent Failures: Replica Synchronization

• Dynamo implements an anti-entropy protocol to keep replicas synchronized. Uses Merkle trees.

• Merkle tree: A hash tree where leaves are hashes of the values of individual keys.

16

Page 17: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

System Architecture (7 of 7)

Membership and Failure Detection

• Explicit mechanism available to initiate the addition and removal of nodes from a Dynamo ring.

• To prevent logical partitions, some Dynamo nodes play the role of seed nodes.

• Seeds: Nodes that are discovered by an external mechanism and known to all nodes.

• Failure detection of communication done in a purely local manner.

• Gossip-based distributed failure detection and membership protocol

17

Page 18: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Implementation

18

Storage NodeStorage Node

Request Coordination

Request Coordination

Membership & Failure Detection

Membership & Failure Detection

Local Persistence Engine

Local Persistence Engine

Pluggable Storage Engines

• Berkeley Database (BDB) Transactional Data Store• BDB Java Edition• MySQL•In-memory buffer with persistent backing store

Chosen based on application’s object size distribution

Pluggable Storage Engines

• Berkeley Database (BDB) Transactional Data Store• BDB Java Edition• MySQL•In-memory buffer with persistent backing store

Chosen based on application’s object size distribution

• Built on top of event-driven messaging substrate

• Uses Java NIO

• Coordinator executes client read & write requests

• State machines created on nodes serving requests

• Built on top of event-driven messaging substrate

• Uses Java NIO

• Coordinator executes client read & write requests

• State machines created on nodes serving requests

• Each state machine instance handles exactly one client request

• State machine contains entire process and failure handling logic

• Each state machine instance handles exactly one client request

• State machine contains entire process and failure handling logic

Page 19: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Experiences, Results & Lessons Learnt (1 of 4)

• Main Dynamo Usage Patterns

1. Business logic specific reconciliation• E.g. Merging different versions of a customer’s shopping cart

2. Timestamp based reconciliation• E.g. Maintaining customer’s session information

3. High performance read engine• E.g. Maintaining product catalog and promotional items

• Client applications can tune parameters to achieve specific objectives:• N: Performance {no. of hosts a data item is replicated at}• R: Availability {min. no. of participating nodes in a successful read opr}• W: Durability {min. no. of participating nodes in a successful write opr}• Commonly used configuration (N,R,W) = (3,2,2)

• Dynamo exposes data consistency & reconciliation logic to developers

• Dynamo adopts a full membership model – each node is aware of the data hosted by its peers

19

Page 20: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Experiences, Results & Lessons Learnt (2 of 4)

• Typical SLA of service using Dynamo: 99.9% of the read and write requests execute within 300 ms

• Balancing Performance and Durability

20

Average & 99.9th percentile latencies of Dynamo’s read and write operations during

a period of 30 days

Comparison of performance of 99.9th percentile latencies for buffered vs. non-buffered

writes over 24 hours

Page 21: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Experiences, Results & Lessons Learnt (3 of 4)

• Ensuring Uniform Load Distribution• Dynamo uses consistent hashing to partition its key space across

its replicas and to ensure uniform load distribution.• Node “in-balance”: request load for node deviates from the

average load by a value less than a certain threshold. Otherwise, Node “out-of-balance”

• Imbalance ratio = Nodes out-of-balance / Total Nodes

21Node imbalance & Workload

Comparison of load distribution efficiency of different strategies

Page 22: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Experiences, Results & Lessons Learnt (4 of 4)

• Three strategies for load distribution1. T random tokens per node and partition by token value2. T random tokens per node and equal sized partitions3. Q/S tokens per node, equal-sized partitions (S= #allnodes, Q=

#partitions)

• Divergent versions of data item (rarely) arise in two scenarios:1. System is facing failure scenarios (node/data center/network)2. Large number of concurrent writers to a single data item

• Server-driven coordination: client requests are uniformly assigned to nodes in the ring by a load balancer.

• Client-driven coordination: client applications use a library to perform request coordination locally.

22

99.9th percentile read latency (ms)

99.9th percentile write latency (ms)

Average read latency (ms)

Average write latency (ms)

Server-driven 68.9 68.5 3.9 4.02

Client-driven 30.4 30.4 1.55 1.9

Page 23: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Conclusion

Dynamo:• Is a highly available and scalable data store• Is used for storing state of a number of core services of

Amazon.com’s e-commerce platform• Has provided desired levels of availability and performance and has

been successful in handling:• Server failures• Data center failures• Network partitions

• Is incrementally scalable• Sacrifices consistency under certain failure scenarios• Extensively uses object versioning and application-assisted conflict

resolution• Allows service owners to:

• scale up and down based on their current request load• customize their storage system to meet desired performance, durability

and consistency SLAs by allowing tuning of N, R, W parameters

• Combination of decentralized techniques can be combined to provide a single highly-available system.

23

Page 24: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Pros

• Excellent description of core distributed systems techniques used in Dynamo:• partitioning, replication, versioning, membership, failure handling,

scaling

• Liberal use of diagrams, charts and tables to explain concepts• Real-world examples have been provided to enable the user to

understand and appreciate the theoretical concepts• Theoretical and implementation-level differences have been

clearly explained• Exhaustive list of references for the interested researcher• Well-written paper with logical transition from one topic to the

next

24

Page 25: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Cons

• Little description of supporting techniques used in Dynamo for:• state transfer, concurrency & job scheduling, request marshalling,

request routing, system monitoring and alarming

• Certain problems which are theoretically possible, have not been investigated in detail, since they have not been encountered in production systems.

• Sophisticated comparison with existing systems has not been provided.

• For protecting Amazon.com’s business interests, certain parts of the system have either not been entirely described or described at a very-high level.

• Future work and possible extensions have not been mentioned clearly.

25

Page 26: Dynamo: Amazon’s Highly Available Key-value Store Giuseppe DeCandia et al. [Amazon.com] Jagrut Sharma jagrutsh@usc.edu CSCI-572 (Prof. Chris Mattmann)

Questions

26