massively parallel/distributed data storage systems

+

Massively Parallel/Distributed Data Storage Systems

S. SudarshanIIT Bombay

Derived from an earlier talk by S. Sudarshan, presented at theMSRI Summer School on Distributed Systems, May/June 2012

IBM ICARE Winter School on Big Data, Oct. 2012

+

2

Why Distributed Data Storage a.k.a. Cloud Storage Explosion of social media sites (Facebook, Twitter)

with large data needs Explosion of storage needs in large web sites such

as Google, Yahoo 100’s of millions of users Many applications with petabytes of data Much of the data is not files

Very high demands on Scalability Availability


+

3

Why Distributed Data Storage a.k.a. Cloud Storage Step 0 (prehistory): Distributed database systems

with tens of nodes Step 1: Distributed file systems with 1000s of nodes

Millions of Large objects (100’s of megabytes) Step 2: Distributed data storage systems with

1000s of nodes 100s of billions of smaller (kilobyte to megabyte)

objects Step 3 (recent and future work): Distributed

database systems with 1000s of nodes


+

4

Examples of Types of Big Data

Large objects video, large images, web logs Typically write once, read many times, no updates Distributed file systems

Transactional data from Web-based applications E.g. social network (Facebook/Twitter) updates, friend lists, likes, … email (at least metadata) Billions to trillions of objects, distributed data storage systems

Indices E.g. Web search indices with inverted list per word In earlier days, no updates, rebuild periodically Today: frequent updates (e.g. Google Percolator)


+

5

Why not use Parallel Databases?

Parallel databases have been around since 1980s Most parallel databases were designed for decision support not

OLTP Were designed for scales of 10s to 100’s of processors Single machine failures are common and are handled

But usually require jobs to be rerun in case of failure during exection Do not consider network partitions, and data distribution

Demands on distributed data storage systems Scalability to thousands to tens of thousands of nodes Geographic distribution for

lower latency, and higher availability


+

6

Basics: Parallel/Distributed Data Storage Replication

System maintains multiple copies of data, stored in different nodes/sites, for faster retrieval and fault tolerance.

Data partitioning Relation is divided into several partitions stored in distinct

nodes/sites

Replication and partitioning are combined Relation is divided into multiple partition; system maintains

several identical replicas of each such partition.


+

7

Basics: Data Replication

Advantages of Replication Availability: failure of site containing relation r does not result

in unavailability of r is replicas exist. Parallelism: queries on r may be processed by several nodes

in parallel. Reduced data transfer: relation r is available locally at each

site containing a replica of r.

Cost of Replication Increased cost of updates: each replica of relation r must be

updated. Special concurrency control and atomic commit mechanisms

to ensure replicas stay in sync


+

8

Basics: Data Transparency

Data transparency: Degree to which system user may remain unaware of the details of how and where the data items are stored in a distributed system

Consider transparency issues in relation to: Fragmentation transparency Replication transparency Location transparency


+

9

Basics: Naming of Data Items

Naming of items: desiderata 1. Every data item must have a system-wide unique name.2. It should be possible to find the location of data items efficiently.3. It should be possible to change the location of data items

transparently.4. Data item creation/naming should not be centralized

Implementations: Global directory

Used in file systems Partition name space

Each partition under control of one node Used for data storage systems


+

10

Build-it-Yourself Parallel Data Storage: a.k.a. Sharding

“Sharding” Divide data amongst many cheap databases

(MySQL/PostgreSQL) Manage parallel access in the application

Partition tables map keys to nodes Application decides where to route storage or lookup requests

Scales well for both reads and writes Limitations

Not transparent application needs to be partition-aware AND application needs to deal with replication

(Not a true parallel database, since parallel queries and transactions spanning nodes are not supported)


+

11

Parallel/Distributed Key-Value Data Stores Distributed key-value data storage systems allow key-

value pairs to be stored (and retrieved on key) in a massively parallel system E.g. Google BigTable, Yahoo! Sherpa/PNUTS, Amazon

Dynamo, .. Partitioning, replication, high availability etc mostly

transparent to application Are the responsibility of the data storage system

These are NOT full-fledged database systems A.k.a. NO-SQL systems

Focus of this talk


+

12

Typical Data Storage Access API

Basic API access: get(key) -- Extract the value given a key put(key, value) -- Create or update the value

given its key delete(key) -- Remove the key and its associated

value execute(key, operation, parameters) -- Invoke an

operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... Etc)

Extensions to add version numbering, etc


+

13

What is NoSQL?

Stands for No-SQL or Not Only SQL?? Class of non-relational data storage systems

E.g. BigTable, Dynamo, PNUTS/Sherpa, .. Synonymous with distributed data storage

systems We don’t like the term NoSQL


+

14

Data Storage Systems vs. Databases Distributed data storage systems do not support

many relational features No join operations (except within partition) No referential integrity constraints across partitions No ACID transactions (across nodes) No support for SQL or query optimization

But usually do provide flexible schema and other features Structured objects e.g. using JSON Multiple versions of data items,


+

15

Querying Static Big Data

Large data sets broken into multiple files

Static append-only data E.g. new files added to dataset each day No updates to existing data

Map-reduce framework for massively parallel querying

Not the focus of this talk. We focus on: Transactional data which is subject to updates Very large number of transactions Each of which reads/writes small amounts of data I.e. online transaction processing (OLTP) workloads


+

Talk Outline

Background: Distributed Transactions Concurrency Control/Replication

Consistency Schemes

Distributed File Systems

Parallel/Distributed Data Storage Systems Basics Architecture

Bigtable (Google) PNUTS/Sherpa (Yahoo) Megastore (Google)

CAP Theorem: availability vs. consistency Basics Dynamo (Amazon)


+

Background: Distributed Transactions

Slides in this section are from Database System Concepts, 6th Edition, by Silberschatz, Korth and Sudarshan, McGraw Hill, 2010


+

19

Transaction System Architecture


+

20

System Failure Modes

Failures unique to distributed systems: Failure of a site. Loss of massages

Handled by network transmission control protocols such as TCP-IP Failure of a communication link

Handled by network protocols, by routing messages via alternative links

Network partition A network is said to be partitioned when it has been split into two or

more subsystems that lack any connection between them Note: a subsystem may consist of a single node

Network partitioning and site failures are generally indistinguishable.


+

21

Commit Protocols

Commit protocols are used to ensure atomicity across sites a transaction which executes at multiple sites must either be

committed at all the sites, or aborted at all the sites. not acceptable to have a transaction committed at one site and

aborted at another

The two-phase commit (2PC) protocol is widely used

The three-phase commit (3PC) protocol is more complicated and more expensive, but avoids some drawbacks of two-phase commit protocol.


+

22

Two Phase Commit Protocol (2PC) Execution of the protocol is initiated by the coordinator

after the last step of the transaction has been reached.

The protocol involves all the local sites at which the transaction executed

Let T be a transaction initiated at site Si, and let the transaction coordinator at Si be Ci


+

23

Phase 1: Obtaining a Decision

Coordinator asks all participants to prepare to commit transaction Ti. Ci adds the records <prepare T> to the log and forces log to stable

storage sends prepare T messages to all sites at which T executed

Upon receiving message, transaction manager at site determines if it can commit the transaction if not, add a record <no T> to the log and send abort T message to Ci

if the transaction can be committed, then: add the record <ready T> to the log force all records for T to stable storage send ready T message to Ci


+

24

Phase 2: Recording the Decision

T can be committed of Ci received a ready T message from all the participating sites: otherwise T must be aborted.

Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record onto stable storage. Once the record stable storage it is irrevocable (even if failures occur)

Coordinator sends a message to each participant informing it of the decision (commit or abort)

Participants take appropriate action locally.


+

25

Three Phase Commit (3PC) Blocking problem in 2PC: if coordinator is disconnected from participant,

participant which had sent a ready message may be in a blocked state Cannot figure out whether to commit or abort

Partial solution: Three phase commit Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1.

Every site is ready to commit if instructed to do so Phase 2 of 2PC is split into 2 phases, Phase 2 and Phase 3 of 3PC

In phase 2 coordinator makes a decision as in 2PC (called the pre-commit decision) and records it in multiple (at least K) sites

In phase 3, coordinator sends commit/abort message to all participating sites

Under 3PC, knowledge of pre-commit decision can be used to commit despite coordinator failure Avoids blocking problem as long as < K sites fail

Drawbacks: higher overhead, and assumptions may not be satisfied in practice


+

26

Distributed Transactions via Persistent Messaging Notion of a single transaction spanning multiple sites is

inappropriate for many applications E.g. transaction crossing an organizational boundary Latency of waiting for commit from remote site

Alternative models carry out transactions by sending messages Code to handle messages must be carefully designed to ensure

atomicity and durability properties for updates Isolation cannot be guaranteed, in that intermediate stages are visible,

but code must ensure no inconsistent states result due to concurrency Persistent messaging systems are systems that provide

transactional properties to messages Messages are guaranteed to be delivered exactly once Will discuss implementation techniques later


+

27

Persistent Messaging

Example: funds transfer between two banks Two phase commit would have the potential to block updates on the

accounts involved in funds transfer Alternative solution:

Debit money from source account and send a message to other site Site receives message and credits destination account

Messaging has long been used for distributed transactions (even before computers were invented!)

Atomicity issue once transaction sending a message is committed, message must

guaranteed to be delivered Guarantee as long as destination site is up and reachable, code to

handle undeliverable messages must also be available e.g. credit money back to source account.

If sending transaction aborts, message must not be sent


+

28

Error Conditions with Persistent Messaging Code to handle messages has to take care of variety of failure

situations (even assuming guaranteed message delivery) E.g. if destination account does not exist, failure message must be sent

back to source site When failure message is received from destination site, or destination site

itself does not exist, money must be deposited back in source account Problem if source account has been closed

get humans to take care of problem

User code executing transaction processing using 2PC does not have to deal with such failures

There are many situations where extra effort of error handling is worth the benefit of absence of blocking E.g. pretty much all transactions across organizations


+

30

Managing Replicated Data

Issues: All replicas should have the same value updates performed

at all replicas But what if a replica is not available (disconnected, or failed)?

What if different transactions update different replicas concurrently? Need some form of distributed concurrency control


+

31

Primary Copy

Choose one replica of data item to be the primary copy. Site containing the replica is called the primary site for that data

item Different data items can have different primary sites

For concurrency control: when a transaction needs to lock a data item Q, it requests a lock at the primary site of Q. Implicitly gets lock on all replicas of the data item Benefit

Concurrency control for replicated data handled similarly to unreplicated data - simple implementation.

Drawback If the primary site of Q fails, Q is inaccessible even though other sites

containing a replica may be accessible.


+

32

Primary Copy

Primary copy scheme for performing updates: Update at primary, updates subsequently replicated to other

copies Updates to a single item are serialized at the primary copy


+

34

Majority Protocol for Locking If Q is replicated at n sites, then a lock request message is

sent to more than half of the n sites in which Q is stored. The transaction does not operate on Q until it has obtained a

lock on a majority of the replicas of Q. When writing the data item, transaction performs writes on all

replicas.

Benefit Can be used even when some sites are unavailable

details on how handle writes in the presence of site failure later

Drawback Requires 2(n/2 + 1) messages for handling lock requests, and

(n/2 + 1) messages for handling unlock requests. Potential for deadlock even with single item - e.g., each of 3

transactions may have locks on 1/3rd of the replicas of a data.


+

35

Majority Protocol for Accessing Replicated Items The majority protocol for updating replicas of an item

Each replica of each item has a version number which is updated when the replica is updated, as outlined below

A lock request is sent to at least ½ the sites at which item replicas are stored and operation continues only when a lock is obtained on a majority of the sites

Read operations look at all replicas locked, and read the value from the replica with largest version number May write this value and version number back to replicas with

lower version numbers (no need to obtain locks on all replicas for this task)


+

36

Majority Protocol for Accessing Replicated Items

Majority protocol (Cont.) Write operations

find highest version number like reads, and set new version number to old highest version + 1

Writes are then performed on all locked replicas and version number on these replicas is set to new version number With 2 phase commit OR distributed consensus protocol such as Paxos

Failures (network and site) cause no problems as long as Sites at commit contain a majority of replicas of any updated data items During reads a majority of replicas are available to find version

numbers Note: reads are guaranteed to see latest version of data item Reintegration is trivial: nothing needs to be done


+

40

Read One Write All (Available)

Read one write all available (ignoring failed sites) is attractive, but incorrect If failed link may come back up, without a disconnected site

ever being aware that it was disconnected The site then has old values, and a read from that site would

return an incorrect value If site was aware of failure reintegration could have been

performed, but no way to guarantee this With network partitioning, sites in each partition may update

same item concurrently believing sites in other partitions have all failed


+

41

Replication with Weak Consistency

Many systems support replication of data with weak degrees of consistency (I.e., without a guarantee of serializabiliy) i.e. QR + QW <= S or 2*QW < S Usually only when not enough sites are available to ensure quorum Tradeoff of consistency versus availability

Many systems support lazy propagation where updates are transmitted after transaction commits Allows updates to occur even if some sites are disconnected from

the network, but at the cost of consistency

What to do if there are inconsistent concurrent updates? How to detect? How to reconcile?


+

42

Availability

High availability: time for which system is not fully usable should be extremely low (e.g. 99.99% availability)

Robustness: ability of system to function spite of failures of components

Failures are more likely in large distributed systems

To be robust, a distributed system must either Detect failures Reconfigure the system so computation may continue Recovery/reintegration when a site or link is repaired

OR follow protocols that guarantee consistency in spite of failures.

Failure detection: distinguishing link failure from site failure is hard (partial) solution: have multiple links, multiple link failure is likely a site failure


+

43

Reconfiguration

Reconfiguration: Abort all transactions that were active at a failed site

Making them wait could interfere with other transactions since they may hold locks on other sites

However, in case only some replicas of a data item failed, it may be possible to continue transactions that had accessed data at a failed site (more on this later)

If replicated data items were at failed site, update system catalog to remove them from the list of replicas. This should be reversed when failed site recovers, but additional care

needs to be taken to bring values up to date If a failed site was a central server for some subsystem, an

election must be held to determine the new server E.g. name server, concurrency coordinator, global deadlock detector


+

47

Distributed Consensus

From Lamport’s Paxos Made Simple:

Assume a collection of processes that can propose values. A consensus algorithm ensures that a single one among the proposed values is chosen. If no value is proposed, then no value should be chosen. If a value has been chosen, then processes should be able to learn

the chosen value. The safety requirements for consensus are:

Only a value that has been proposed may be chosen, Only a single value is chosen, and A process never learns that a value has been chosen unless it actually

has been

Paxos: a family of protocols for distributed consensus


+

48

Paxos: Overview

Three kinds of participants (site can play more than one role) Proposer: proposes a value Acceptor: accepts (or rejects) a proposal, following a protocol

Consensus is reached when a majority of acceptors have accepted a proposal

Learner: finds what value (if any) was accepted by a majority of acceptors Acceptor generally informs each member of a set of learners


+



+

53


Google File System (GFS)

Hadoop File System (HDFS)

And older ones like CODA,

Basic architecture: Master: responsible for metadata Chunk servers: responsible for reading and writing large

chunks of data Chunks replicated on 3 machines, master responsible for

managing replicas Replication is within a single data center


SecondaryNameNode

Client

HDFS Architecture

NameNode

DataNodes

1. Send filename

2. Get back BlckId, DataNodeso

3.Read data

NameNode : Maps a file to a file-id and list of MapNodesDataNode : Maps a block-id to a physical location on disk


+

55

Google File System (OSDI 04)

Inspiration for HDFS

Master: responsible for all metadata

Chunk servers: responsible for reading and writing large chunks of data

Chunks replicated on 3 machines

Master responsible for ensuring replicas exist


+

56

Limitations of GFS/HDFS

Central master becomes bottleneck Keep directory/inode information in memory to avoid IO Memory size limits number of files

File system directory overheads per file Not appropriate for storing very large number of objects

File systems do not provide consistency guarantees File systems cache blocks locally Ideal for write-once and and append only data Can be used as underlying storage for a data storage system

E.g. BigTable uses GFS underneath


+

Parallel/Distributed Data Storage Systems


+

58

Typical Data Storage Access API

Basic API access: get(key) -- Extract the value given a key put(key, value) -- Create or update the value

given its key delete(key) -- Remove the key and its associated

value execute(key, operation, parameters) -- Invoke an

operation to the value (given its key) which is a special data structure (e.g. List, Set, Map .... Etc)

Extensions to add version numbering, etc


+

59

Data Types in Data Storage Systems

Uninterpreted key/value or ‘the big hash table’. Amazon S3 (Dynamo)

Flexible schema Ordered keys, semi-structured data

BigTable [Google], Cassandra [Facebook/Apache], Hbase [Apache, similar to BigTable]

Unordered keys, JSON Sherpa/Pnuts [Yahoo], CouchDB [Apache]

Document/Textual data (with JSON variant) MongoDB [10gen, open source]


+

60

E.g. of Flexible Data ModelColumnFamily: Inventory Details

Key ValueA112341

A235122

B234567

Name Value

MemoryinventoryQty3G

Ipad (new) 16GB512MB100No

name

Name Value

MemoryinventoryQtyScreen

ASUS Laptop3GB915 inch

name

nameName Value

All Terrain VehicleWheelsinventoryQtyPower 10HP

35


+

61

Bigtable: Data model <Row, Column, Timestamp> triple for key

lookup (point and range) E.g. prefix lookup by com.cnn.* in example below

insert and delete API Allows inserting and deleting versions for any

specific cell Arbitrary “columns” on a row-by-row basis

Partly column-oriented physical store: all columns in a “column family” are stored together


+

Architecture of distributed storage systems

There are many more distributed data storage systems, we will not do a full survey, but just cover some representatives Some of those we don’t cover:

Cassandra, Hbase, CouchDB, MongoDB

Some we cover later: Dynamo

• Bigtable• PNUTS/Sherpa• Megastore


+

Bigtable

Massively parallel data storage system Supports key-value storage and

lookup Incorporates flexible schema Designed to work within a single

data center Does not address distributed

consistency issues, etc

Built on top of Chubby distributed fault tolerant

lock service Google File System (GFS)


+

64

Distributing Load Across Servers

Table split into multiple tabletsTablet servers manage tablets, multiple tablets per

server. Each tablet is 100-200 MB Each tablet controlled by only one server Tablet server splits tablets that get too big

Master responsible for load balancing and fault tolerance

All data and logs stored in GFSLeverage GFS replication/fault toleranceData can be accessed if required from any node to aid

in recovery


+

65

Chubby (OSDI’06)

{lock/file/name} service

5 replicas, tolerant to failures of individual replicas need a majority vote to be active Uses Paxos algorithm for distributed consensus

Coarse-grained locks can store small amount of data in a lock Used as fault tolerant directory

In Bigtable Chubby is used to locate master node Detect master and tablet server failures using lock leases

More on this shortly Choose new master/tablet server in a safe manner using lock service


+

66

How To Layer Storage on Top of File System

File system supports large objects, but with Significant per-object overhead, and Limits on number of objects No guarantees of consistency if updates/reads are made at different nodes

How to store small—object store on top of file system? Aggregate multiple small objects into a file (SSTable)

SSTable: Immutable sorted file of key-value pairs

Chunks of data plus an index Index is of block ranges, not values

Index

64K block

64K block

64K block

SSTable


+

67

Tablets: Partitioning Data across Nodes

Tablet: Unit of partitioning Contains some range of rows of the table Built out of multiple SSTables

Key range of SSTables can overlap Lookups access all SSTables One of the SSTables is in memory (Memtable)

Inserts done on memtable

Index

64K block

64K block

64K block

SSTable

Index

64K block

64K block

64K block

SSTable

Tablet Start:aardvark End:apple


+

68

Immutability

SSTables are immutable simplifies caching, sharing across GFS etc no need for concurrency control SSTables of a tablet recorded in METADATA table Garbage collection of SSTables done by master On tablet split, split tables can start off quickly on shared

SSTables, splitting them lazily

Only memtable has reads and updates concurrent copy on write rows, allow concurrent read/write


+

69

Editing a table Mutations are logged, then applied to an in-memory memtable

May contain “deletion” entries to handle updates Lookups match up relevant deletion entries to filter out logically deleted data

Periodic compaction merges multiple SSTables and removes logically deleted entries

SSTable SSTable

Tablet

apple_two_E boat

Insert

Insert

Delete

Insert

Delete

Insert

Memtable

tabl

et lo

g

GFS

Memory


+

70

Table Multiple tablets make up the table

Tablet ranges do not overlap Directory to provide key-range to tablet mapping

Each tablet assigned to one server (tablet server) But data is in GFS, replication handled by GFS If master is reassigned, new master simply fetches data from GFS

All updates/lookups on tablet routed through tablet server Primary copy scheme


+

71

Finding a tablet

Directory Entries: Key: table id + end row, Data: location

Cached at clients, which may detect data to be incorrect in which case, lookup on hierarchy performed

Also prefetched (for range queries) Identifies tablet, server found separately via master


+

72

Master’s Tasks

Use Chubby to monitor health of tablet servers, restart failed servers Tablet server registers itself by getting a lock in a specific

directory chubby Chubby gives “lease” on lock, must be renewed periodically Server loses lock if it gets disconnected

Master monitors this directory to find which servers exist/are alive If server not contactable/has lost lock, master grabs lock and

reassigns tablets GFS replicates data. Prefer to start tablet server on same machine

that the data is already at


+

73

Master’s Tasks (Cont)

When (new) master starts grabs master lock on chubby

Ensures only one master at a time Finds live servers (scan chubby directory) Communicates with servers to find assigned tablets Scans metadata table to find all tablets

Keeps track of unassigned tablets, assigns them Metadata root from chubby, other metadata tablets assigned before

scanning.


+

74

Metadata Management

Master handles table creation, and merging of tablet

Tablet servers directly update metadata on tablet split, then notify master lost notification may be detected lazily by master


+

79

Transactions in Bigtable

Bigtable supports only single-row transactions atomic read-modify-write sequences on data stored under a

single row key No transactions across sites

Limitations are many E.g. secondary indices cannot be supported consistently

PNUTS provides different approach based on persistent messaging


+

Yahoo PNUTS (a.k.a. Sherpa)

Goals similar to Bigtable, but some significant architectural differences Not layered on distributed file system,

handles distribution and replication by itself Master node takes care of above

tasks Master node is replicated using

master-slave backup Assumption: joint failure of master

node and replica unlikely Similar to GFS assumption about

master Can use existing storage systems

such as MySQL or BerlekeyDB underneath

Focus on geographic replication


Data-path components

Storage units

RoutersTablet

controller

REST API

Clients

MessageBroker

Detailed architecture

This slide from Brian Cooper’s talk at VLDB 2008


Storageunits

Routers

Tablet controller

REST API

Clients

Local region Remote regions

YMB

Detailed architecture



+

83

Yahoo Message Bus

Distributed publish-subscribe serviceGuarantees delivery once a message is published

Logging at site where message is published, and at other sites when received

Guarantees messages published to a particular cluster will be delivered in same order at all other clusters

Record updates are published to YMB by master copy All replicas subscribe to the updates, and get them in

same order for a particular record



+

84

Managing Concurrency

Managing concurrency is now the programmers job Updates are sent lazily to replicas

Update replica 1, read replica 2 can give old value Replication is no longer transparent

To deal with this, PNUTS provides version numbering Reads can specify version number Write can be conditional on version number

Application programmer has to manage concurrency

With great scale comes great responsibility …


+

85

Consistency model Goal: make it easier for applications to reason about updates and

cope with asynchrony

What happens to a record with primary key “Brian”?

Time

Record inserted

Update Update Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1v. 6 v. 8

Update Update



+

86

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7


Current version

Stale versionStale version

Read



+

87

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7


Read up-to-date

Current version




+

88

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7


Read ≥ v.6

Current version


Read-critical(required version):



+

89

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7


Write

Current version




+

90

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7


Write if = v.7

ERROR

Current version


Test-and-set-write(required version)



+

91

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7


Write if = v.7

ERROR

Current version


Mechanism: per record mastership



+

92

Record and Tablet Mastership

Data in PNUTS is replicated across sites

Hidden field in each record stores which copy is the master copy updates can be submitted to any copy forwarded to master, applied in order received by master

Record also contains origin of last few updates Mastership can be changed by current master, based on this

information Mastership change is simply a record update

Tablets mastership Required to ensure primary key consistency Can be different from record mastership



+

93

Other Features

Per record transactions

Copying a tablet (on failure, for e.g.) Request copy Publish checkpoint message Get copy of tablet as of when checkpoint is received Apply later updates

Tablet split Has to be coordinated across all copies



+

Asynchronous View Maintenance in PNUTS

Queries involving joins/aggregations are very expensive

Alternative: materialized view that are maintained on each update Small (constant) amount of work per

update

Asynchronous view maintenance To avoid imposing latency on updater Readers may get out of date view Readers can ask for view state after a

certain update has been appliedP. Agrawal et al, SIGMOD 2009


+

95

Asynchronous View Maintenance in PNUTS Two kinds of views:

Remote view Basically just a repartitioning of data on different key

Local view Can aggregate local tuples

Tuples of each table are partitioned on some partitioning key Maintained synchronously with updates

Can layer views to express more complex queries No join views

But can co-locate tuples from two relations based on join key And can compute join locally on demand


+

96

Asynchronous View Maintenance in PNUTS (from Agrawal, SIGMOD 09)


+

97

Remote View Maintenance (from Agrawal, SIGMOD 09)

Clients

Query Routers

Storage Servers

Log Managers

API

a. disk write in logb. cache write in

storagec. return to userd. flush to disk e. remote view

maintenancef. view log message(s)g. view update(s) in

storage

a

bd

Remote Maintainer

f

e gg


+

98

Aggregates by scatter-gather (from Agrawal, SIGMOD 09)

1 Dave

TV …

2 Jack GPS …

7 Dave

GPS …

8 Tim TV …

4 Alex DVD

…

5 Jack TV …

Number of TV reviews

Reviews(review-id,user,item,text)

GPS 1TV 1

Avoids expensive scans. Simple materialized views.

Scatter gather

DVD 1TV 1

GPS 1TV 1

Local views

+


Local Aggregates over Repartitioned Data (from Agrawal, SIGMOD 09)

1 Dave

TV …

2 Jack GPS …

7 Dave

GPS …

8 Tim TV …

4 Alex DVD

…

5 Jack TV …

Reviews(review-id,user,item,text)

ByItem(item,review-id,user,text)

DVD

4 Alex … TV 1 Dave

…

TV 5 JackTV 8 Tim …

GPS 2 Jack …GPS 7 Dav

e…DVD 1

GPS 2TV 3

Number of TV reviews


+

100

Consistency Model for Views

Base consistency model: timelines of all records are independent

Multiple records of the views are connected to base records.

Information in a view record vr comes from a base record r vr is dependent on r while r is incident on vr

Indexes, selections, equi-joins: one to one

Group by/aggregates: Many to one

Goal: to ensure view record read is consistent with some update which has taken place earlier Request specifying a version blocks until update is reflected in the view How to implement?


+

101

Reading View Records Based on Versions LVT is always up-to-date wrt local replica of the underlying

base table Version request mapped to underlying base table

For RVTs:

Single record reads for one-to-one views: Each view record has single incident base record View records can use the version no. of the base records Consistency guarantees are inherited from the base table

reads: ReadAny(vr), ReadCritical(vr, v'), ReadLatest(vr)


+

102

Reading View Records Based on Versions Single record reads for many-to-one views:

Multiple br are incident on single vr ReadAny: reads any version of base records. ReadCritical: Needs specific versions for certain subsets of

base records and ReadAny for all other base records. For e.g. When user updates some tuples and then reads back, he

would like to see the latest version of the updated records and relatively stale other records may be acceptable.

Base record versions are available in RVT/base table on which view is defined

ReadLatest: Accesses base table, high cost unavoidable


+

Google’s Megastore

Layer on top of Bigtable

Provides geographic replication

Key feature: notion of entity group Provides full ACID properties for

transactions which access only a single entity group

Plus support for 2PC/persistent messaging for transactions that span entity groups

Use of Paxos for distributed consensus For atomic commit across replicas


+

104

Megastore: Motivation

Many applications need transactional support beyond individual rows

Entity group: set of related rows E.g: records of a single user, or of a group ACID transactions are supported within an entity group

Some transactions need to cross entity groups E.g. bidirectional relationships Use 2PC or persistent messaging

Data structures spanning multiple entity groups E.g. secondary indices


+

105

Entity Group Example (Baker et al., CIDR 11) CREATE TABLE User {

required int64 user_id; required string name;} PRIMARY KEY(user_id), ENTITY GROUP ROOT;

CREATE TABLE Photo { required int64 user_id; required int32 photo_id; required int64 time; required string full_url; optional string thumbnail_url; repeated string tag;} PRIMARY KEY(user_id, photo_id), IN TABLE User, ENTITY GROUP KEY(user_id) REFERENCES User;

CREATE GLOBAL INDEX PhotosByTag ON Photo(tag) STORING (thumbnail_url);


+

106

Megastore Architecture (from Baker et al, CIDR 11)


+

107

Megastore: Operations across Entity Groups (from Baker et al, CIDR 11)


+

108

Multiversioning in Megastore

Megastore implements multiversion concurrency control (MVCC) within an entity group when mutations within a transaction are applied, the values

are written at the timestamp of their transaction. Readers use the timestamp of the last fully applied trans-

action to avoid seeing partial updates Snapshot isolation

Updates are applied after commit Current read applies all committed updates, and returns current

value Readers and writers don’t block each other


+

109

Megastore Replication Architecture (Baker et al. CIDR 11)


+

110

Managing Replicated Data in Megastore Options

Master/slave replication: need to detect failure of master, and initiate transfer of mastership to new site Not an issue within a data center (e.g. GFS or Bigtable) Consensus protocols such as Paxos need to transfer mastership

Megastore decided to go with Paxos for managing data update, not just for deciding master/coordinator

Application can control placement of entity-group replicas Usually 3 to 4 data centers near place where entity group is

accessed


+

111

Paxos in Megastore Key idea: replicate a write-ahead log over a group of symmetric peers

(Optimized version of) Paxos used to append log records to log Each log append blocks on acknowledgments from a majority of replicas

replicas in the minority catch up as they are able Paxos’ inherent fault tolerance eliminates the need for a distinguished

“failed” state

One log per entity group Stored in Bigtable row for root of entity group

All logs of a transaction written at once Sequence number for an entity group (log position)

read at beginning of txn, and add 1 to get new sequence number Concurrent transactions on same entity group will generate same

sequence number only one succeeds in Paxos commit phase


+

114

Performance of Megastore

Optimized Paxos found to perform well Most reads are local

Thanks to coordinator optimization, and The fact that most writes succeed on all replicas


+

115

Other (recent) stuff from Google 1: Spanner Descendant of Bigtable, Successor to Megastore

Differences: hierarchical directories instead of rows, fine-grained replication; replication configuration at the per-directory level

Globally distributed Replicas spread across country to survive regional disasters

Synchronous cross-datacenter replication

Transparent sharding, data movement

General transactions Multiple reads followed by a single atomic write Local or cross-machine (using 2PC)

Snapshot reads


+

116

Other (recent) stuff from Google 2: F1 Scalable Relation Storage F1: Fault-Tolerant Distributed RDBMS Supporting

Google's Ad Business (Shute et al. SIGMOD 2012)

Hybrid combining Scalability of Bigtable Usability and functionality of SQL databases

Key Ideas Scalability: Auto-sharded storage Availability & Consistency

Synchronous replication, consistent indices High commit latency can be hidden Hierarchical schema and data storage


+

117

Queries and Transactions in F1

Preferred transaction structure One read phase: No serial reads

Read in batches Read asynchronously in parallel

Buffer writes in client, send as one RPC

Client library with lightweight Object “Relational” Mapping


+

CAP Theorem: Consistency vs. Availability

Basic principles

Amazon Dynamo


+

125

What is Consistency?

Consistency in Databases (ACID): Database has a set of integrity constraints A consistent database state is one where all integrity

constraints are satisfied Each transaction run individually on a consistent database

state must leave the database in a consistent state

Consistency in distributed systems with replication Strong consistency: a schedule with read and write

operations on an object should give results and final state equivalent to some schedule on a single copy of the object, with order of operations from a single site preserved

Weak consistency (several forms)


+

126

Availability

Traditionally, availability of centralized server For distributed systems, availability of system to process

requests For large system, at almost any point in time there’s a good

chance that a node is down or even Network partitioning

Distributed consensus algorithms will block during partitions to ensure consistency

Many applications require continued operation even during a network partition

Even at cost of consistency


+

127

Brewer’s CAP Theorem

Three properties of a system Consistency (all copies have same value) Availability (system can run even if parts have failed)

Via replication Partitions (network can break into two or more parts,

each with active systems that can’t talk to other parts) Brewer’s CAP “Theorem”: You can have at most

two of these three properties for any system Very large systems will partition at some point

Choose one of consistency or availablity Traditional database choose consistency Most Web applications choose availability

Except for specific parts such as order processing


+

128

Replication with Weak Consistency Many systems support replication of data with weak degrees

of consistency (I.e., without a guarantee of serializabiliy) i.e. QR + QW <= S or 2*QW < S Usually only when not enough sites are available to ensure

quorum But sometimes to allow fast local reads

Tradeoff of consistency versus availability or latency

Key issues: Reads may get old versions Writes may occur in parallel, leading to inconsistent versions

Question: how to detect, and how to resolve Will see in detail shortly


+

129

Eventual Consistency

When no updates occur for a long period of time, eventually all updates will propagate through the system and all the nodes will be consistent

For a given accepted update and a given node, eventually either the update reaches the node or the node is removed from service

Known as BASE (Basically Available, Soft state, Eventual consistency), as opposed to ACID Soft state: copies of a data item may be inconsistent Eventually Consistent – copies becomes consistent at some

later time if there are no more updates to that data item


+

130

Availability vs Latency Abadi’s classification system: PACELC

CAP theorem only matters when there is a partition Even if partitions are rare, applications may trade off consistency

for latency E.g. PNUTS allows inconsistent reads to reduce latency

Critical for many applications But update protocol (via master) ensures consistency over availability

Thus Abadi asks two questions: If there is Partitioning, how does system tradeoff Availability for

Consistency Else how does system trade off Latency for Consistency

E.g. Megastore: PC/EC PNUTS: PC/EL Dynamo (by default): PA/EL


+

Amazon Dynamo

Distributed data storage system supporting very high availability Even at cost of consistency E.g. motivation from Amazon: Web

users should always be able to add items to their cart Even if they are connected to an

app server which is now in a minority partition

Data should be synchronized with majority partition eventually

Inconsistency may be visible (briefly) to users preferable to losing a customer!

DynamoDB: part of Amazon Web Service, can subscribe and use over the Web


+

132

Dynamo: Basics

Provides a key-value store with basic get/put interface Data values entirely uninterpreted by system

Unlike Bigtable, PNUTS, Megastore, etc.

Underlying storage based on DHTs using consistent hashing with virtual processors

Replication (N-ary) Data stored in node to which key is mapped, as well as N-1

consecutive successors in ring Replication at level of key range (virtual node) Put call may return before data has been stored on all replicas

Reduces latency, at risk of consistency Programmer can control degree of consistency (QR, QW and S) per

instance (relation)


+

133

Dynamo: Versioning of Data Data items are versioned

Each update creates a new immutable version

In absence of failure, there is a single latest version

But with failures, version can diverge Need to detect, and fix such situations

Key idea: vector clock identifies each data version Set of (node, counter) pairs Define a partial order across versions Read operation may find incomparable versions

All such versions returned by read operation Up to application to reconcile multiple versions


+

134

Vector Clocks for Versioning Examples of vector clocks

([Sx,1]): data item created by site Sx ([Sx,2]): data item created by site Sx, and later updated ([Sx,2],[Sy,1]): data item updated twice by site Sx and once by Sy

Update by a site Sx increments the counter for Sx, but leaves counters from other sites unchanged

([Sx,4],[Sy,1]) newer than ([Sx,3],[Sy,1]) But ([Sx,2],[Sy,1]) incomparable with ([Sx,1],[Sy,2])

A practical issue: Vector clock can become long Updates handled by one of the sites containing a replica Timestamps used to prune out very old members of vector

update presumed to have reached all replicas

Vector clocks have been around a long while, not invented by Dynamo


+

135

Example of Vector Clock (from DeCandia et al., SOSP 2007)

Item created by site Sx

Item concurrently updated by site Sy and Sz (usually due to network partitioning)

1. Read at Sx returns two incomparable versions

2. Application merges versions and writes new version

Item updated by site Sx


+

136

Performing Put/Get Operations

Get/put requests handled by a coordinator (one of the nodes containing a replica of the item)

Upon receiving a put() request for a key the coordinator generates the vector clock for the new version and writes the

new version locally The coordinator then sends the new version (along with the new vector clock)

to the N highest-ranked reachable nodes. If at least QW-1 nodes respond then the write is considered successful.

For a get() request the coordinator requests all existing versions of data for that key from the N

highest-ranked reachable nodes in the preference list for that key, Waits for QR responses before returning the result to the client. Returns all causally unrelated (incomparable) versions Application should reconcile divergent versions and write back a reconciled

version superseding the current versions


+

137

How to Reconcile Inconsistent Versions? Reconciliation is application specific

E.g. two sites concurrent insert items to cart Merge adds both items to the final cart state

E.g. S1 adds item A, S2 deletes item B Merge adds item A, but deleted item B resurfaces Cannot distinguish S2 deletes B from S1 add B Problem: operations are inferred from states of divergent versions Better solution (not supported in Dynamo) keep track of history of

operations


+

138

Dealing with Failure

Hinted handoff: If replica is not available at time of write, write performed on

another (live) node, Write operations will not fail due to temporary node

unavailability

Risk of divergence of data value due to above

To detect divergence Each node maintains a separate Merkle tree for each key

range (the set of keys covered by a virtual node) it hosts Nodes exchange roots of Merkle trees with other replicas to

detect any divergence


+

139

Consistency Models for Applications Read-your-writes

if a process has performed a write, a subsequent read will reflect the earlier write operation

Session consistency Read-your-writes in the context of a session, where application

connects to storage system

Monotonic consistency For reads: later reads never return older version than earlier

reads For writes: serializes writes by a single proces

Minimum requirement


+

140

Implementing Consistency Models Sticky sessions: all operations from a session on a data

item go to the same node

Get operations can specify a version vector Result of get guaranteed to be at least as new as specified

version vector


+

Conclusions, Future Work, and References


+

142

Conclusions

New generation of massively parallel/distributed data storage systems developed to address issues of scale, availability and latency in extremely

large scale Web applications Revisiting old issues in terms of replication, consistency and distributed

consensus Some new technical ideas, but key contributions are in terms of

engineering at scale

Developers need to be aware of Tradeoffs between consistency, availability and latency How to choose between these for different parts of their applications And how to enforce their choice on the data storage platform

Good old days of database ensuring ACID properties are gone! But life is a lot more interesting now


+

143

Future Work

We are at the beginning of a new wave Programmers expected to open the hood and get hands dirty

Programming model to deal with different consistency levels for different parts of an application Hide consistency issues as much as possible from programmer

Need better support for logical operations to allow resolving of conflicting updates At system level At programmer level, to reason about application consistency

E.g. Relationship between Consistency And Logical Monotonicity (“CALM principle”): Monotonic programs guarantee eventual consistency under any interleaving of delivery and computation.

Tools for testing/verification


+

144

Some References D. Abadi, Consistency Tradeoffs in Modern Distributed Database System Design, IEEE

Computer, Feb 2012

P. Agrawal et al, Asynchronous View Maintenance for VLSD Databases, SIGMOD 2009

P. Alvaro et al., Consistency Analysis in Bloom: a CALM and Collected Approach, CIDR 2011

J. Baker et al, Megastore: Providing Scalable, Highly Available Storage for Interactive Services, CIDR 2011

F. Chang et al., Bigtable: A Distributed Storage System for Structured Data, OSDI 2006

G. DeCandia et al., Dynamo: Amazon’s Highly Available Key-value Store, SOSP 2007

S. Das et al, G-Store: A Scalable Data Store for Transactional Multi key Access in the Cloud, SoCC, June 2012

S. Finkelstein et al., Transactional Intent, CIDR 2011

S. Gilbert and N. Lynch, Perspectives on the CAP Theorem, IEEE Computer, Feb 2012

L. Lamport, Paxos Made Simple ACM SIGACT News (Distributed Computing Column) 32, 4, 2011

W. Vogels, Eventual Consistency, CACM Jan 2008

+

END OF TALK


+

146

Figure 19.02


+

147

Figure 19.03


+

148

Figure 19.07


+

149

Bully Algorithm

If site Si sends a request that is not answered by the coordinator within a time interval T, assume that the coordinator has failed Si tries to elect itself as the new coordinator.

Si sends an election message to every site with a higher identification number, Si then waits for any of these processes to answer within T.

If no response within T, assume that all sites with number greater than i have failed, Si elects itself the new coordinator.

If answer is received Si begins time interval T’, waiting to receive a message that a site with a higher identification number has been elected.


+

150

Bully Algorithm (Cont.)

If no message is sent within T’, assume the site with a higher number has failed; Si restarts the algorithm.

After a failed site recovers, it immediately begins execution of the same algorithm.

If there are no active sites with higher numbers, the recovered site forces all processes with lower numbers to let it become the coordinator site, even if there is a currently active coordinator with a lower number.


+

Background: Concurrency Control and Replication in Distributed Systems


+

152

Concurrency Control

Modify concurrency control schemes for use in distributed environment.

We assume that each site participates in the execution of a commit protocol to ensure global transaction atomicity.

We assume all replicas of any item are updated Will see how to relax this in case of site failures later


+

153

Single-Lock-Manager Approach

System maintains a single lock manager that resides in a single chosen site, say Si

When a transaction needs to lock a data item, it sends a lock request to Si and lock manager determines whether the lock can be granted immediately If yes, lock manager sends a message to the site which

initiated the request If no, request is delayed until it can be granted, at which time a

message is sent to the initiating site


+

154

Single-Lock-Manager Approach (Cont.) The transaction can read the data item from any one of the

sites at which a replica of the data item resides.

Writes must be performed on all replicas of a data item

Advantages of scheme: Simple implementation Simple deadlock handling

Disadvantages of scheme are: Bottleneck: lock manager site becomes a bottleneck Vulnerability: system is vulnerable to lock manager site failure.


+

155

Distributed Lock Manager

In this approach, functionality of locking is implemented by lock managers at each site Lock managers control access to local data items

But special protocols may be used for replicas

Advantage: work is distributed and can be made robust to failures

Disadvantage: deadlock detection is more complicated Lock managers cooperate for deadlock detection

More on this later

Several variants of this approach Primary copy Majority protocol Biased protocol Quorum consensus


+

Distributed Commit Protocols


+

157

Two Phase Commit Protocol (2PC) Assumes fail-stop model – failed sites simply stop

working, and do not cause any other harm, such as sending incorrect messages to other sites.

Execution of the protocol is initiated by the coordinator after the last step of the transaction has been reached.

The protocol involves all the local sites at which the transaction executed

Let T be a transaction initiated at site Si, and let the transaction coordinator at Si be Ci


+

158

Phase 1: Obtaining a Decision

Coordinator asks all participants to prepare to commit transaction Ti. Ci adds the records <prepare T> to the log and forces log to stable

storage sends prepare T messages to all sites at which T executed

Upon receiving message, transaction manager at site determines if it can commit the transaction if not, add a record <no T> to the log and send abort T message to Ci

if the transaction can be committed, then: add the record <ready T> to the log force all records for T to stable storage send ready T message to Ci


+

159

Phase 2: Recording the Decision

T can be committed of Ci received a ready T message from all the participating sites: otherwise T must be aborted.

Coordinator adds a decision record, <commit T> or <abort T>, to the log and forces record onto stable storage. Once the record stable storage it is irrevocable (even if failures occur)

Coordinator sends a message to each participant informing it of the decision (commit or abort)

Participants take appropriate action locally.


+

160

Handling of Failures - Site Failure

When site Sk recovers, it examines its log to determine the fate of

transactions active at the time of the failure.

Log contain <commit T> record: site executes redo (T)

Log contains <abort T> record: site executes undo (T)

Log contains <ready T> record: site must consult Ci to determine the fate of T. If T committed, redo (T) If T aborted, undo (T)

The log contains no control records concerning T implies that Sk failed before responding to the prepare T message from Ci

since the failure of Sk precludes the sending of such a response Ci must abort T

Sk must execute undo (T)


+

161

Handling of Failures- Coordinator Failure

If coordinator fails while the commit protocol for T is executing then participating sites must decide on T’s fate:1. If an active site contains a <commit T> record in its log, then T must be

committed.2. If an active site contains an <abort T> record in its log, then T must be

aborted.3. If some active participating site does not contain a <ready T> record in

its log, then the failed coordinator Ci cannot have decided to commit T. Can therefore abort T.

4. If none of the above cases holds, then all active sites must have a <ready T> record in their logs, but no additional control records (such as <abort T> of <commit T>). In this case active sites must wait for Ci

to recover, to find decision.

Blocking problem: active sites may have to wait for failed coordinator to recover.


+

162

Handling of Failures - Network Partition If the coordinator and all its participants remain in one partition,

the failure has no effect on the commit protocol.

If the coordinator and its participants belong to several partitions: Sites that are not in the partition containing the coordinator think

the coordinator has failed, and execute the protocol to deal with failure of the coordinator. No harm results, but sites may still have to wait for decision from

coordinator.

The coordinator and the sites are in the same partition as the coordinator think that the sites in the other partition have failed, and follow the usual commit protocol.

Again, no harm results


+

163

Recovery and Concurrency Control In-doubt transactions have a <ready T>, but neither a

<commit T>, nor an <abort T> log record.

The recovering site must determine the commit-abort status of such transactions by contacting other sites; this can slow and potentially block recovery.

Recovery algorithms can note lock information in the log. Instead of <ready T>, write out <ready T, L> L = list of locks held by T

when the log is written (read locks can be omitted). For every in-doubt transaction T, all the locks noted in the

<ready T, L> log record are reacquired.

After lock reacquisition, transaction processing can resume; the commit or rollback of in-doubt transactions is performed concurrently with the execution of new transactions.


+

164

Three Phase Commit (3PC)

Assumptions: No network partitioning At any point, at least one site must be up. At most K sites (participants as well as coordinator) can fail

Phase 1: Obtaining Preliminary Decision: Identical to 2PC Phase 1. Every site is ready to commit if instructed to do so

Phase 2 of 2PC is split into 2 phases, Phase 2 and Phase 3 of 3PC In phase 2 coordinator makes a decision as in 2PC (called the pre-commit decision) and

records it in multiple (at least K) sites In phase 3, coordinator sends commit/abort message to all participating sites,

Under 3PC, knowledge of pre-commit decision can be used to commit despite coordinator failure Avoids blocking problem as long as < K sites fail

Drawbacks: higher overheads assumptions may not be satisfied in practice


+

165

Timestamping Timestamp based concurrency-control protocols can be used

in distributed systems

Each transaction must be given a unique timestamp

Main problem: how to generate a timestamp in a distributed fashion Each site generates a unique local timestamp using either a

logical counter or the local clock. Global unique timestamp is obtained by concatenating the unique

local timestamp with the unique identifier.


+

166

Timestamping (Cont.)

A site with a slow clock will assign smaller timestamps Still logically correct: serializability not affected But: “disadvantages” transactions

To fix this problem Define within each site Si a logical clock (LCi), which

generates the unique local timestamp Require that Si advance its logical clock whenever a request is

received from a transaction Ti with timestamp < x,y> and x is greater that the current value of LCi.

In this case, site Si advances its logical clock to the value x + 1.


+

167

Comparison with Remote Backup

Remote backup (hot spare) systems (Section 17.10) are also designed to provide high availability

Remote backup systems are simpler and have lower overhead All actions performed at a single site, and only log records

shipped No need for distributed concurrency control, or 2 phase commit

Using distributed databases with replicas of data items can provide higher availability by having multiple (> 2) replicas and using the majority protocol Also avoid failure detection and switchover time associated with

remote backup systems

massively parallel/distributed data storage systems

Documents

data distributiondemands

data partitioningrelation

distributed systems

multiple copies of data

reduced data transfer

distributed databases

distributed database

syncibm icare winter