pnuts: yahoo!’s hosted data serving platform yahoo! research present by liyan & fang

30
PNUTS: Yahoo!’s Hosted Data Serving Platform Yahoo! Research present by Liyan & Fang

Upload: gyles-johns

Post on 18-Dec-2015

224 views

Category:

Documents


0 download

TRANSCRIPT

PNUTS: Yahoo!’s Hosted Data Serving Platform

Yahoo! Research

present by Liyan & Fang

2

social network websitesBrian

Sonja Jimi Brandon Kurt

What are my friends up to?

Sonja:

Brandon:

3

What does a web application need?• Scalability

– architectural scalability– scale during periods of rapid growth with minimal

operational effort

• Response Time and Geographic Scope– Fast response time to geographically distributed users

• High Availability and Fault Tolerance– Read and even write data in failures

• Relaxed Consistency Guarantees– Eventually consistency: update one replica first and then

update others

4

What do we need from our DBMS?

• Web applications need:– Scalability

• And the ability to scale linearly

– Geographic scope– High availability

• Web applications typically have:– Simplified query needs

• No joins, aggregations

– Relaxed consistency needs• Applications can tolerate stale or reordered data

What is PNUTS?

5

6

What is PNUTS?

E 75656 C

A 42342 EB 42521 W

C 66354 W

D 12352 E

F 15677 E

E 75656 C

A 42342 EB 42521 W

C 66354 W

D 12352 E

F 15677 E

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

CREATE TABLE Parts (ID VARCHAR,StockNumber INT,Status VARCHAR…

)

Parallel databaseParallel database Geographic replicationGeographic replication

Indexes and viewsIndexes and views

Structured, flexible schemaStructured, flexible schema

Hosted, managed infrastructureHosted, managed infrastructure

A 42342 E

B 42521 W

C 66354 W

D 12352 E

E 75656 C

F 15677 E

7

Query model• Per-record operations

– Get– Set– Delete

• Multi-record operations– Multiget– Scan– Getrange

8

Data-path componentsData-path components

Storage units

Routers

Tablet controller

REST API

Clients

MessageBroker

Detailed architecture

Data tables are horizontally partitioned into groups of records called tablets.Storage units: store tabletsrespond to get() and scan() requests by retrieving and returning matching records respond to set() requests by processing the update.

If we want to commit the update result, need to write them to Message Broker firstly.

Router:

determine which storage unit is responsible fora given record to be read or written by the client,

we must first determine which tablet contains the record,

and then determine which storage unit has that tablet

tablet controller :determineswhen it is time to move a tablet between storage units forload balancing or recovery when a large tablet must be split.

update the copy of the interval mapping.

9

Storageunits

Routers

Tablet controller

REST API

Clients

Local region Remote regions

YMB

Detailed architecture

record-level mastering:mastership is assigned on a record-by-record basis, and different records in the same table can be mastered in different clusters.

In one week, 85 percent of the writes to a given record originated in the same datacenter.

A master publishes its updates to a single broker, and thus updates are delivered to replicas in commit order.

YMB takes multiple steps to ensure messages are not lost before they are applied to the database.

messages published to one YMB cluster will be relayed to other YMB clusters for delivery to local subscribers

Query processing

10

13

Range queries

MIN-Canteloupe

SU1

Canteloupe-Lime

SU3

Lime-Strawberry

SU2

Strawberry-MAX

SU1

Storage unit 1 Storage unit 2 Storage unit 3

Router

AppleAvocadoBananaBlueberry

CanteloupeGrapeKiwiLemon

LimeMangoOrangePearStrawberryTomatoWatermelon

Grapefruit…Pear?

Grapefruit…Lime?

Lime…Pear?

SU1Strawberry-MAX

SU2Lime-Strawberry

SU3Canteloupe-Lime

SU4MIN-Canteloupe

14

Updates

1

Write key k

2Write key k7 Sequence # for key k

8 Sequence # for key k

SU SU SU

3Write key k

4

5SUCCESS

6Write key k

RoutersMessage brokers

Asynchronous replication and consistency

15

16

Asynchronous replication

17

• Goal: make it easier for applications to reason about updates and cope with asynchrony

• What happens to a record with primary key “Brian”?

Consistency model

Time

Record inserted

Update Update Update UpdateUpdate Delete

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Update Update

18

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Current version

Stale versionStale version

Read

Read-any:Returns a possibly stale version of the record.

e.g., in a social networking application, for displaying a user’s friend’s status, it is not absolutely essential to get the most up-to-date value, and hence read-any can be used.

19

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Read up-to-date

Current version

Stale versionStale version

20

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Read ≥ v.6

Current version

Stale versionStale version

Read-critical(required version):

Read-critical:Returns a version of the record that is strictly newer than, or the same as the required version.

For example, when a user writes a record, and then wants to read a version of the record that definitely reflects his changes.

21

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Write

Current version

Stale versionStale version

22

Consistency model

Timev. 1 v. 2 v. 3 v. 4 v. 5 v. 7

Generation 1

v. 6 v. 8

Write if = v.7

ERROR

Current version

Stale versionStale version

Test-and-set-write(required version)

Test-and-set-write(required version): This call performs the requested write to the record if and only if the present version of the record is the same as required version.

This call can be used to implement transactions that first read a record, and then do a write to the record based on the read, e.g., incrementing the value of a counter..

23

Record and Tablet Mastership• Data in PNUTS is replicated across sites• Hidden field in each record stores which copy is the

master copy– updates can be submitted to any copy– forwarded to master, applied in order received by master

• Record also contains origin of last few updates– Mastership can be changed by current master, based on this

information– Mastership change is simply a record update

• Tablets mastership– Required to ensure primary key consistency– Can be different from record mastership

24

Other Features

• Per record transactions• Copying a tablet (failure recovery, for e.g.)

– Request copy– Publish checkpoint message– Get copy of tablet as of when checkpoint is

received– Apply later updates

• Tablet split– Has to be coordinated across all copies

25

Query Processing• Range scan can span tablets

– Only one tablet scanned at a time– Client may not need all results at once

• Continuation object returned to client to indicate where range scan should continue

• Notification– One pub-sub topic per tablet– Client knows about tables, does not know about tablets

• Automatically subscribed to all tablets, even as tablets are added/removed.

– Usual problem with pub-sub: undelivered notifications, handled in usual way

Experiments

26

27

Experimental setup• Production version supported by

– Hash tables– ordered tables

• Database– 3 regions: 2 west coast, 1 east coast– 1 KB records, 128 tablets per region– Each process had 100 client threads, – Totally 300 clients across the system.

• Workload– 1200-3600 requests/second– 0-50% writes– 80% locality

28

Inserts

• Inserts (hash tables)– required 75.6 ms per insert in West 1 (tablet master)– 131.5 ms per insert into the non-master West 2, and – 315.5 ms per insert into the non-master East.

• Inserts (ordered tables)– 33 ms per insert in West 1– 105.8 ms per insert in the non-master West2– 324.5 ms per insert in the non-master East.

29

10% writes by default

latency decreases, andthen increases, with increasing load

The high latency at low request rate resulted froman anomaly in the HTTP client library we used, which closedTCP connections in between requests at low request rates,requiring expensive TCP setup for each call.

As the proportion of reads increases, the average latency decreases.

30

Scalability

0

20

40

60

80

100

120

140

160

1 2 3 4 5 6

Storage units

Ave

rag

e la

ten

cy (

ms)

Hash table Ordered table

31

Size of range scans

0

1000

2000

3000

4000

5000

6000

7000

8000

0 0.02 0.04 0.06 0.08 0.1 0.12

Fraction of table scanned

Ave

rag

e la

ten

cy (

ms)

30 clients 300 clients

32

Thanks!