google spanner - synchronously-replicated, globally-distributed, multi-version database
DESCRIPTION
Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version DatabaseTRANSCRIPT
![Page 1: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/1.jpg)
Internet-scale Distributed Systems
Google Spanner a
Synchronously-Replicated Globally-Distributed
Multi-Version Database
22.01.2013 Maciej Jozwiak Page 1
Presented by: Maciej Jozwiak
![Page 2: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/2.jpg)
Internet-scale Distributed Systems
Agenda • Problem description
• Overview of available solutions
• Globally-distributed database
• Architecture
• How is data replicated?
• Data model
• TrueTime API
• Transactions
• Summary
22.01.2013 Maciej Jozwiak Page 2
![Page 3: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/3.jpg)
Internet-scale Distributed Systems
Problem – Need for Scalable MySQL • Google’s advertising backend
– Based on MySQL • Relations
• Query language
– Manually sharded • Resharding is very costly
– Global distribution
22.01.2013 Maciej Jozwiak Page 3
SHARDING:
Sharding is another name for "horizontal partitioning" of a database. Rows of a database table are held separately, form a partition which can be located on a separate database server or physical location.
![Page 4: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/4.jpg)
Internet-scale Distributed Systems 22.01.2013 Maciej Jozwiak Page 4
• Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance • Lack of query language
• Scalability • Throughput • Performance • Eventually-consistent replication support across data-centers
Overview of Available Solutions
Google Megastore
![Page 5: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/5.jpg)
Internet-scale Distributed Systems 22.01.2013 Maciej Jozwiak Page 5
• Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance • Lack of query language
• Scalability • Throughput • Performance • Eventually-consistent replication support across data-centers
Overview of Available Solutions
Google Megastore
![Page 6: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/6.jpg)
Internet-scale Distributed Systems 22.01.2013 Maciej Jozwiak Page 6
• Replicated ACID transactions • Schematized semi-relational tables • Synchronous replication support across data-centers • Performance • Lack of query language
• Scalability • Throughput • Performance • Eventually-consistent replication support across data-centers
Overview of Available Solutions
Google Megastore
![Page 7: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/7.jpg)
Internet-scale Distributed Systems
Bridging the gap between Megastore and Bigtable
22.01.2013 Maciej Jozwiak Page 7
Google Megastore
• Removes the need to manually partition data • Synchronous replication and automatic failover • Strong transactional semantics • SQL based query language • Semi-relational, schematized tables
Solution: Google Spanner
![Page 8: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/8.jpg)
Internet-scale Distributed Systems
Globally-Distributed Database
22.01.2013 Maciej Jozwiak Page 8
Future scale: • one million to 10 million servers • 100s to 1000s locations around the world • 1013 directories • 1018 bytes of storage
cross-datacenter replicated data management: • high availability • minimize latency of data reads and writes • replication configuration dynamically controlled at a fine grain by applications
![Page 9: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/9.jpg)
Internet-scale Distributed Systems
Spanner Deployment - Universe
22.01.2013 Maciej Jozwiak Page 9
Universe master (status + interactive debugging)
Placement driver (move data across
zones automatically)
![Page 10: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/10.jpg)
Internet-scale Distributed Systems
How Is Data Replicated?
22.01.2013 Maciej Jozwiak Page 10
Paxos: protocols for solving consensus in a network of unreliable processors. Consensus is the process of agreeing on one result among a group of participants. This problem becomes difficult when the participants or their communication medium may experience failures.
Spanserver software stack
![Page 11: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/11.jpg)
Internet-scale Distributed Systems
Replication Configuration
• Replication configurations for data can be dynamically controllered at a fine grain by applications
• Applications can specify constraints to control:
– which datacenters contain which data
– how far data is from user (to control read latency)
– how far replicas are from each other (to control write latency)
– how many replicas are maintained (to control durability, availability, and read performance) • North America: 5 replicas, Europe 2 replicas
22.01.2013 Maciej Jozwiak Page 11
![Page 12: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/12.jpg)
Internet-scale Distributed Systems
Hierarchical Data Model • Universe (Spanner deployment)
– Database
• Tables – Rows and columns
– Must have an ordered set one or more primary key columns
– Primary key uniquely identifies each row
• Hierarchies of tables – Tables must be partioned by client into one or more
hierarchies of tables (INTERLEAVE IN)
– Table in the top – directory table
22.01.2013 Maciej Jozwiak Page 12
![Page 13: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/13.jpg)
Internet-scale Distributed Systems
Storing Photo Metadata
22.01.2013 Maciej Jozwiak Page 13
![Page 14: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/14.jpg)
Internet-scale Distributed Systems
Storing Photo Metadata
22.01.2013 Maciej Jozwiak Page 14
directory table
directory table
![Page 15: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/15.jpg)
Internet-scale Distributed Systems
Storing Photo Metadata
22.01.2013 Maciej Jozwiak Page 15
directory
![Page 16: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/16.jpg)
Internet-scale Distributed Systems
Storing Photo Metadata
22.01.2013 Maciej Jozwiak Page 16
directory
![Page 17: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/17.jpg)
Internet-scale Distributed Systems
Storing Photo Metadata
22.01.2013 Maciej Jozwiak Page 17
Albums(2,1) – row from the Albums table for user_id 2, album_id 1 Interleaving is important because it allows clients to describe the locality relationship which is necessary for good performance in a sharded, distributed database.
![Page 18: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/18.jpg)
Internet-scale Distributed Systems
Key Innovation
22.01.2013 Maciej Jozwiak Page 18
Spanner knows what time is it
![Page 19: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/19.jpg)
Internet-scale Distributed Systems
Is Synchronizing Time at the Global Scale Possible?
22.01.2013 Maciej Jozwiak Page 19
Distributed systems dogma: • synchronizing time within and between datacenters is extremely hard and uncertain • serialization of requests is impossible at global scale
![Page 20: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/20.jpg)
Internet-scale Distributed Systems
Is Synchronizing Time at the Global Scale Possible?
22.01.2013 Maciej Jozwiak Page 20
Distributed systems dogma: • synchronizing time within and between datacenters is extremely hard and uncertain • serialization of requests is impossible at global scale
![Page 21: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/21.jpg)
Internet-scale Distributed Systems
Is Synchronizing Time at the Global Scale Possible?
22.01.2013 Maciej Jozwiak Page 21
Idea: Accept uncertainty, keep it small and quantify (using GPS and Atomic Clocks)
![Page 22: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/22.jpg)
Internet-scale Distributed Systems
TrueTime API
22.01.2013 Maciej Jozwiak Page 22
Idea: Accept uncertainty, keep it small and quantify (using GPS and Atomic Clocks)
Novel API distributing a globally synchronized „proper time”
Method Returns
TT.now() TTinterval: [earliest, latest]
TT.after(t) True if t has definitely passed
TT.before(t) True if t has definitely not arrived
TT interval - is guaranteed to contain the absolute time during which TT.now() was invoked
![Page 23: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/23.jpg)
Internet-scale Distributed Systems
How TrueTime Is Implemented?
22.01.2013 Maciej Jozwiak Page 23
set of time master machines per datacenter
majority of masters have GPS receivers with dedicated antennas
timeslave daemon per machine
The remaining masters (which we refer to as Armageddon masters) are equipped with atomic clocks.
![Page 24: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/24.jpg)
Internet-scale Distributed Systems
Time References Vulnerabilities
• GPS:
– antenna and receiver failures
– local radio interference
– correlated failures (e.g. spoofing)
– GPS system outages
• Atomic clock:
– can drift significantly due to frequency error
2 forms of time reference – 2 failure modes (uncorrelated to each other):
22.01.2013 Maciej Jozwiak Page 24
![Page 25: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/25.jpg)
Internet-scale Distributed Systems
How Does Daemon Work?
22.01.2013 Maciej Jozwiak Page 25
Daemon polls variety of masters: • chosen from nearby datacenters • from further datacenters • Armageddon masters
Daemon polls variety of masters and reaches a consensus about correct timestamp. Daemon’s poll interval is 30 seconds.
Between synchronizations daemon advertises a slowy increasing time uncertainty (e)
![Page 26: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/26.jpg)
Internet-scale Distributed Systems
Transactions In Spanner
• Globally meaningful commit timestamps to distributed transactions
– If A happens-before B, then timestamp(A) < timestamp (B)
– A happens-before B if its effects become visible before B begins, in real time • Visible means acked to client or updates applied to some replica
• Begins means first request arrived at Spanner server
• Two-phase commit
22.01.2013 Maciej Jozwiak Page 26
![Page 27: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/27.jpg)
Internet-scale Distributed Systems
What About Performance?
22.01.2013 Maciej Jozwiak Page 27
„We believe it is better to have application
programmers deal with performance problems
due to overuse of transactions as bottlenecks arise,
rather than always coding around the lack of
transactions.”
Two-phase commit can raise availability and performance
issues.
![Page 28: Google Spanner - Synchronously-Replicated, Globally-Distributed, Multi-Version Database](https://reader033.vdocuments.mx/reader033/viewer/2022042815/55626341d8b42aed7d8b4d28/html5/thumbnails/28.jpg)
Internet-scale Distributed Systems
Summary
• Externally consistent global write-transactions with synchronous replication.
• Schematized, semi-relational data model.
• SQL-like query interface.
• Auto-sharding, auto-rebalancing, automatic failure response.
• Exposes control of data replication and placement to user/application.
22.01.2013 Maciej Jozwiak Page 28