hyder a transactional record manager for shared flash · snapshot 3. append intention log record...
TRANSCRIPT
Hyder – A Transactional Record Manager for Shared Flash
Philip A. Bernstein, Microsoft Corporation
Colin Reid, Microsoft Corporation
Sudipto Das, UC Santa Barbara
© 2011 Microsoft Corporation
CIDR 2011 January 10, 2011
Hyder: The Big Picture
• Goal: Enable scale-out without partitioning DB or app
2
• Store the whole DB in flash – which is accessible to all servers – via a fast data center network
• Main architectural features – Uses a log-structured DB in flash – Broadcast log to all servers – Roll forward log on all servers – Optimistic concurrency control
Network
Internet
Hyder Log
Server
Hyder App
Server
Hyder App
Server
Hyder App
• There’s no cross-talk between servers
– Hence, Hyder scales-out without partitioning
What is Hyder?
3
An incubation, i.e. research project.
A software stack for transactional record management
• Stores [key, value] pairs, which are accessed within transactions
Functionality • Record operations:
– Insert, Delete, Update, Get where field = X; Get next
• Transactions: Start, Commit, Abort
Why build another one? • Exploit flash memory and high-speed networks
to simplify scaling out large-scale web services
Network
Scaling Out with Partitioning
4
Internet
Database Partition
App
$
Log
Data
Web Server
App $
$
• Database is partitioned across multiple servers
• Each query is sent to the appropriate partition(s)
• For scalability, avoid distributed transactions
• Cross partition consistency is enforced in the application
• Hard to provision servers and distribute load evenly
$ $ $ $
Web Server
App $
Web Server
App $
Database Partition
App
$
Log
Data
Database Partition
App
$
Log
Data
Database Partition
App
$
Log
Data
Network
Hyder Scales Out Without Partitioning
5
Internet • In Hyder, the log is the database
• All servers can access the log
• No partitioning is required
• Database is multi-versioned, so server caches are trivially coherent
• Hence, can parallelize a query with consistency across servers
• And servers can fetch pages from the log or from neighboring servers’ caches
Hyder Log
Web Server
Hyder $
App
Web Server
Hyder $
App
Web Server
Hyder $
App
Hyder Runs in the Application Process
6
• No distributed programming
• No distributed caches for the app to keep consistent
• Avoids the expense of RPC’s to a database server
• Simple high performance programming model Network
Internet
Hyder Log
Web Server
Hyder $
App
Web Server
Hyder $
App
Web Server
Hyder $
App
Enabling Hardware Assumptions
• Flash offers cheap and abundant I/O operations
Can spread the DB across a log, with less physical contiguity
• Cheap high-performance data center networks Many servers can share storage, with high performance
• Large, cheap, 64-bit addressable memories Reduces the rate that Hyder needs to access the log
• Many-core web servers
Hyder can afford to roll forward the log on all servers
7
The Hyder Stack
• Segments, stripes and streams
8
• Optimistic transaction protocol
• ISAM, SQL, LINQ, etc.
• Append-only custom controller interface
• Multi-versioned search tree
Network
Scalable Reliable Storage Layer
Indexed Record Layer
Transaction Layer
API
Database is a Search Tree
9
G
B
A
H
C I
D
A D C B I H G
Binary Search Tree
Tree is marshaled into the log
In this paper, it’s a binary search tree.
D
Update D’s value
Binary Tree is Multi-versioned
10
G
B
A
H
C I
D
• Copy on write
• To update a node, replace nodes up to the root
C
G
B
Transaction Execution • Each server has a cache of the last committed database state
11
A D C B I H G
DB cache
G
B
C
D
H
I A
last committed
Transaction execution 1. Get pointer to snapshot G
B
C
D
2. Generate updates locally
D C B G
Snapshot
3. Append intention log record
• A transaction reads a snapshot and writes an intention log record
Log Updates are Broadcast
12
Broadcast intention
Broadcast ack
Transaction Commit • Each server rolls forward transactions in log sequence
• When it processes an intention log record,
– it checks whether the transaction experienced a conflict
– if not, the transaction committed and the server merges the intention into its last committed state
• All servers make the same commit/abort decisions
13
A D C B I H G
T’s conflict zone
Did a committed transaction write into T’s readset or writeset here?
transaction T
D C B G Snapshot
Performance Bottleneck Analysis
• There are 4 bottlenecks in the update pipeline
1. 100K log-appends/second, assuming 20-way parallel flash storage
2. Broadcast 67K update transactions/second over 10 Gb Ethernet
3. Meld can do up to 400K update transactions/second
4. Opt CC: Abort rate depends on conflict probability and txn latency – Suppose transaction latency is 200 μs
– If all txns conflict, best case SR execution is serial ==> 5000 TPS
– With random arrivals ==> ~ 1600 update TPS
14
Throughput with High Data Contention
15
8
24
40
56
72
10 20 30 40 50 60 70 80
Thro
ugh
pu
t (K
tp
s)
Offered Load (K tps)
• 8 reads, 2 writes per transaxn
• 99% of ops access 1% of data
• Serializable isolation
• Assume the network, log, and meld can perform 100K intentions/sec
DB-Size-10K
TxnSize-20, DB-100K
R:W=1:1, DB-100K
DB-Size-50K
DB-Size-100K
Thrashing due to Resource Contention • Thrashing occurs when exceeding the maximum resource
throughput of 100K/second
16
0
100
200
300
400
500
0
20
40
60
80
100
95 97 99 101 103 105
Tran
sact
ion
Lat
en
cy (
ms)
Tran
sact
ion
Th
rou
ghp
ut
(K
tp
s)
Offered Load (K tps)
Hot95-5
Hot80-20
Uniform
Latency
Major Technologies • Flash is append-only. Custom controller has
mechanisms for synchronization & fault tolerance
• Storage is striped, with a self-adaptive algorithm for storage allocation and load balancing
• Fault-tolerant protocol for a totally ordered log
• Fast meld algorithm to detect conflicts and merge intention records into last-committed state
17
Summary of Contributions • A new data-sharing architecture for scaling out without
partitioning.
• A fault-tolerant append-only log that arbitrates concurrent appends by independent servers.
• A log-structured multiversion binary-search-tree index.
• An efficient meld algorithm to detect conflicts & merge committed updates into the last-committed state.
• A simulation analysis of the Hyder architecture under a variety of workloads and system configurations.
18
Errata for the paper
• In the 5th paragraph of Section 2.3 on sliding window striping, “AppendStripe” should be “AppendPage”.
• Also, the following paper should have been included as related work:
– Radu Stoica, Manos Athanassoulis, Ryan Johnson, Anastasia Ailamaki: Evaluating and repairing write performance on flash devices. DaMoN 2009: 9-14
19