hyder a transactional record manager for shared flash · snapshot 3. append intention log record...

Hyder – A Transactional Record Manager for Shared Flash

Philip A. Bernstein, Microsoft Corporation

Colin Reid, Microsoft Corporation

Sudipto Das, UC Santa Barbara

© 2011 Microsoft Corporation

CIDR 2011 January 10, 2011

Hyder: The Big Picture

• Goal: Enable scale-out without partitioning DB or app

2

• Store the whole DB in flash – which is accessible to all servers – via a fast data center network

• Main architectural features – Uses a log-structured DB in flash – Broadcast log to all servers – Roll forward log on all servers – Optimistic concurrency control

Network

Internet

Hyder Log

Server

Hyder App

Server

Hyder App

Server

Hyder App

• There’s no cross-talk between servers

– Hence, Hyder scales-out without partitioning

What is Hyder?

3

An incubation, i.e. research project.

A software stack for transactional record management

• Stores [key, value] pairs, which are accessed within transactions

Functionality • Record operations:

– Insert, Delete, Update, Get where field = X; Get next

• Transactions: Start, Commit, Abort

Why build another one? • Exploit flash memory and high-speed networks

to simplify scaling out large-scale web services

Network

Scaling Out with Partitioning

4

Internet

Database Partition

App

$

Log

Data

Web Server

App $

$

• Database is partitioned across multiple servers

• Each query is sent to the appropriate partition(s)

• For scalability, avoid distributed transactions

• Cross partition consistency is enforced in the application

• Hard to provision servers and distribute load evenly

$ $ $ $

Web Server

App $

Web Server

App $

Database Partition

App

$

Log

Data

Database Partition

App

$

Log

Data

Database Partition

App

$

Log

Data

Network

Hyder Scales Out Without Partitioning

5

Internet • In Hyder, the log is the database

• All servers can access the log

• No partitioning is required

• Database is multi-versioned, so server caches are trivially coherent

• Hence, can parallelize a query with consistency across servers

• And servers can fetch pages from the log or from neighboring servers’ caches

Hyder Log

Web Server

Hyder $

App

Web Server

Hyder $

App

Web Server

Hyder $

App

Hyder Runs in the Application Process

6

• No distributed programming

• No distributed caches for the app to keep consistent

• Avoids the expense of RPC’s to a database server

• Simple high performance programming model Network

Internet

Hyder Log

Web Server

Hyder $

App

Web Server

Hyder $

App

Web Server

Hyder $

App

Enabling Hardware Assumptions

• Flash offers cheap and abundant I/O operations

Can spread the DB across a log, with less physical contiguity

• Cheap high-performance data center networks Many servers can share storage, with high performance

• Large, cheap, 64-bit addressable memories Reduces the rate that Hyder needs to access the log

• Many-core web servers

Hyder can afford to roll forward the log on all servers

7

The Hyder Stack

• Segments, stripes and streams

8

• Optimistic transaction protocol

• ISAM, SQL, LINQ, etc.

• Append-only custom controller interface

• Multi-versioned search tree

Network

Scalable Reliable Storage Layer

Indexed Record Layer

Transaction Layer

API

Database is a Search Tree

9

G

B

A

H

C I

D

A D C B I H G

Binary Search Tree

Tree is marshaled into the log

In this paper, it’s a binary search tree.

D

Update D’s value

Binary Tree is Multi-versioned

10

G

B

A

H

C I

D

• Copy on write

• To update a node, replace nodes up to the root

C

G

B

Transaction Execution • Each server has a cache of the last committed database state

11

A D C B I H G

DB cache

G

B

C

D

H

I A

last committed

Transaction execution 1. Get pointer to snapshot G

B

C

D

2. Generate updates locally

D C B G

Snapshot

3. Append intention log record

• A transaction reads a snapshot and writes an intention log record

Log Updates are Broadcast

12

Broadcast intention

Broadcast ack

Transaction Commit • Each server rolls forward transactions in log sequence

• When it processes an intention log record,

– it checks whether the transaction experienced a conflict

– if not, the transaction committed and the server merges the intention into its last committed state

• All servers make the same commit/abort decisions

13

A D C B I H G

T’s conflict zone

Did a committed transaction write into T’s readset or writeset here?

transaction T

D C B G Snapshot

Performance Bottleneck Analysis

• There are 4 bottlenecks in the update pipeline

1. 100K log-appends/second, assuming 20-way parallel flash storage

2. Broadcast 67K update transactions/second over 10 Gb Ethernet

3. Meld can do up to 400K update transactions/second

4. Opt CC: Abort rate depends on conflict probability and txn latency – Suppose transaction latency is 200 μs

– If all txns conflict, best case SR execution is serial ==> 5000 TPS

– With random arrivals ==> ~ 1600 update TPS

14

Throughput with High Data Contention

15

8

24

40

56

72

10 20 30 40 50 60 70 80

Thro

ugh

pu

t (K

tp

s)

Offered Load (K tps)

• 8 reads, 2 writes per transaxn

• 99% of ops access 1% of data

• Serializable isolation

• Assume the network, log, and meld can perform 100K intentions/sec

DB-Size-10K

TxnSize-20, DB-100K

R:W=1:1, DB-100K

DB-Size-50K

DB-Size-100K

Thrashing due to Resource Contention • Thrashing occurs when exceeding the maximum resource

throughput of 100K/second

16

0

100

200

300

400

500

0

20

40

60

80

100

95 97 99 101 103 105

Tran

sact

ion

Lat

en

cy (

ms)

Tran

sact

ion

Th

rou

ghp

ut

(K

tp

s)

Offered Load (K tps)

Hot95-5

Hot80-20

Uniform

Latency

Major Technologies • Flash is append-only. Custom controller has

mechanisms for synchronization & fault tolerance

• Storage is striped, with a self-adaptive algorithm for storage allocation and load balancing

• Fault-tolerant protocol for a totally ordered log

• Fast meld algorithm to detect conflicts and merge intention records into last-committed state

17

Summary of Contributions • A new data-sharing architecture for scaling out without

partitioning.

• A fault-tolerant append-only log that arbitrates concurrent appends by independent servers.

• A log-structured multiversion binary-search-tree index.

• An efficient meld algorithm to detect conflicts & merge committed updates into the last-committed state.

• A simulation analysis of the Hyder architecture under a variety of workloads and system configurations.

18

Errata for the paper

• In the 5th paragraph of Section 2.3 on sliding window striping, “AppendStripe” should be “AppendPage”.

• Also, the following paper should have been included as related work:

– Radu Stoica, Manos Athanassoulis, Ryan Johnson, Anastasia Ailamaki: Evaluating and repairing write performance on flash devices. DaMoN 2009: 9-14

19

hyder a transactional record manager for shared flash · snapshot 3. append intention log record...

Documents