20110515 google megastore

29
1 1 Megastore - Providing Scalable, Highly Available J. Baker, C. Bond, J.C. Corbett, JJ Furman, A. Khorlin, J. Larson, J-M Léon, Y. Li, A. Lloyd, V. Yushprakh Google Inc. [email protected] May. 2011

Upload: linuxfb

Post on 12-May-2015

866 views

Category:

Technology


0 download

TRANSCRIPT

Page 1: 20110515 google megastore

111

Megastore - Providing Scalable, Highly Available

J. Baker, C. Bond,

J.C. Corbett, JJ Furman,

A. Khorlin, J. Larson,

J-M Léon, Y. Li, A. Lloyd,

V. Yushprakh

Google Inc.

[email protected]. 2011

Page 2: 20110515 google megastore

22

Agenda

Motivation

Architecture

ACID over NOSQL Database

Replication via Paxos

Operational Results

Page 3: 20110515 google megastore

33

Motivation

Build a system to please everyone (users, admins, developers).

Page 4: 20110515 google megastore

44

Motivation

High availability – Fully functional during planned maintenance periods, as well as most unplanned infrastructure issues.

Scalability – Service huge audience of potential users.

ACID – Easier for writing and deploying applications.

Page 5: 20110515 google megastore

55

Agenda

Motivation

Megastore Architecture

ACID over NOSQL Database

Replication via Paxos

Operational Results

Page 6: 20110515 google megastore

66

Megastore Overview

Widely deployed in Google for several years.

Used on more than 100 production applications.

Handles more than 3 billion write and 20 billion read transactions daily.

Stores nearly a petabyte of primary data across many global datacenters.

Available on GAE since Jan 2011.

Page 7: 20110515 google megastore

77

Architecture

Built on top of Bigtable and Chubby.

Blends the scalability of a NoSQL datastore with the convenience of a traditional RDBMS

Synchronous replication based on Paxos across datacenters.

Page 8: 20110515 google megastore

88

Architecture

Page 9: 20110515 google megastore

99

Architecture

Scalable replication.

Page 10: 20110515 google megastore

1010

Architecture

Operation across Entity Groups

Page 11: 20110515 google megastore

1111

Agenda

Motivation

Megastore Architecture

ACID over NOSQL Database

Replication via Paxos

Operational Results

Page 12: 20110515 google megastore

1212

Data Model

Somewhere between RDBMS and row-column storage of NOSQL.SchemasTables (Entity group root table/child table, child table must have a single distinguished foreign key referencing root table)EntitiesProperties

Page 13: 20110515 google megastore

1313

Sample Schema

Page 14: 20110515 google megastore

1414

Mapping to Bigtable

Primary Keys are chosen to cluster entities that will be read together.

Each entity is mapped into a single Bigtable row.

“IN TABLE” instructs to colocate tables into the same Bigtable, and key ordering ensures Photo entities are stored adjacent to corresponding User.

Bigtable column name = Megastore table name + property name

Page 15: 20110515 google megastore

1515

Indexes

Two level of indexes:Local index: Separate indexes for each entity group. Stored in entity group and updated atomically and consistently.Global index: Span entity groups. Not guaranteed to reflect all recent updates.

Page 16: 20110515 google megastore

1616

Transactions & Concurrency

Entity group is a mini-database providing serializable ACID semantics.

MVCC (MultiVersion Concurrency Control) using transaction timestamp

Reads and Writes are isolated

Page 17: 20110515 google megastore

1717

Transactions & Concurrency

Three level of reads consistencyCurrent: apply all previous committed logs before read within a single entity group.Snapshot: pick the last known fully applied transaction to read, within a single entity group.Inconsistent: ignore the state of log and read the latest value directly.

Page 18: 20110515 google megastore

1818

Transactions & Concurrency

Write transaction:Current read: Obtain the timestamp and log position of the last committed transaction.Application logic: Read from Bigtable and gather writes into a log entry.Commit: Use Paxos to achieve consensus for appending the log entry to log.Apply: Write mutations to the entities and indexes in Bigtable.Clean up: Delete temp data.

Page 19: 20110515 google megastore

1919

Transactions & Concurrency

Queues provide transactional messaging between entity groups. Declaring a queue automatically creates an inbox on each entity group (scale automatically).

Two phase commit

Queue is recommended over two phase commit.

Page 20: 20110515 google megastore

2020

Agenda

Motivation

Megastore Architecture

ACID over NOSQL Database

Replication via Paxos

Operational Results

Page 21: 20110515 google megastore

2121

Paxos

Basic Paxos

Multi-Paxos

Page 22: 20110515 google megastore

2222

Reads

Page 23: 20110515 google megastore

2323

Writes

Page 24: 20110515 google megastore

2424

Failure Detection

Coordinators obtain specific Chubby locks in remote datacenters at startup.

If it ever loses a majority of its locks from a crash or network partition, it will consider all entity groups in its purview to be out-of-date.

reads at the replica must query the log position from a majority of replicas until the locks are regained and its coordinator entries are revalidated.

all writers must wait for the coordinator's Chubby locks to expire before writes can complete

Page 25: 20110515 google megastore

2525

Agenda

Motivation

Megastore Architecture

ACID over NOSQL Database

Replication via Paxos

Operational Results

Page 26: 20110515 google megastore

2626

Distribution of Availability

Page 27: 20110515 google megastore

2727

Distribution of Average Latencies

Page 28: 20110515 google megastore

2828

Conclusion

Most users see five nines availability

Average read latencies are tens of milliseconds, indicating most reads are local.

Most writes costs 100-400 milliseconds.

Page 29: 20110515 google megastore

2929

Questions?