everest: scaling down peak loads through i/o off-loading

40
Everest: scaling down peak loads through I/O off-loading D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge, UK hanzler666 @ UKClimbing.com

Upload: kelvin

Post on 07-Feb-2016

31 views

Category:

Documents


1 download

DESCRIPTION

Everest: scaling down peak loads through I/O off-loading. D. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A. Rowstron Microsoft Research Cambridge, UK. hanzler666 @ UKClimbing.com. Problem: I/O peaks on servers. Short, unexpected peaks in I/O load This is not about predictable trends - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Everest: scaling down peak loads through I/O  off-loading

Everest:scaling down peak loads through

I/O off-loadingD. Narayanan, A. Donnelly, E. Thereska, S. Elnikety, A.

RowstronMicrosoft Research Cambridge, UK

hanzler666 @ UKClimbing.com

Page 2: Everest: scaling down peak loads through I/O  off-loading

Problem: I/O peaks on servers

• Short, unexpected peaks in I/O load– This is not about predictable trends

• Uncorrelated across servers in data center– And across volumes on a single server

• Bad I/O response times during peaks2Everest: write off-loading for I/O peaks

Page 3: Everest: scaling down peak loads through I/O  off-loading

Example: Exchange server• Production mail server

– 5000 users, 7.2 TB across 8 volumes• Well provisioned

– Hardware RAID, NVRAM, over 100 spindles

• 24-hour block-level I/O trace– Peak load, response time is 20x mean– Peaks are uncorrelated across volumes

3Everest: write off-loading for I/O peaks

Page 4: Everest: scaling down peak loads through I/O  off-loading

Exchange server load

4Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000

Time of day

Load

(req

s/s/

volu

me)

Page 5: Everest: scaling down peak loads through I/O  off-loading

Everest client

Write off-loading

5Everest: write off-loading for I/O peaks

Reads and Writes

No off-loadingOff-loading

Reads

Writes

Volume

Reclaiming

Reclaims

Everest store

Everest store

Everest store

Page 6: Everest: scaling down peak loads through I/O  off-loading

Exploits workload properties• Peaks uncorrelated across volumes

– Loaded volume can find less-loaded stores

• Peaks have some writes– Off-load writes reads see less

contention• Few foreground reads on off-loaded

data– Recently written hence in buffer cache– Can optimize stores for writes

6Everest: write off-loading for I/O peaks

Page 7: Everest: scaling down peak loads through I/O  off-loading

Challenges• Any write anywhere

– Maximize potential for load balancing• Reads must always return latest

version– Split across stores/base volume if

required• State must be consistent,

recoverable– Track both current and stale versions

• No meta-data writes to base volume7Everest: write off-loading for I/O peaks

Page 8: Everest: scaling down peak loads through I/O  off-loading

Design features

• Recoverable soft state• Write-optimized stores• Reclaiming off-loaded data• N-way off-loading• Load-balancing policies

8Everest: write off-loading for I/O peaks

Page 9: Everest: scaling down peak loads through I/O  off-loading

Recoverable soft state• Need meta-data to track off-loads

– block ID <location, version>– Latest version as well as old (stale)

versions• Meta-data cached in memory

– On both clients and stores• Off-loaded writes have meta-data

header– 64-bit version, client ID, block range 9Everest: write off-loading for I/O peaks

Page 10: Everest: scaling down peak loads through I/O  off-loading

Recoverable soft state (2)• Meta-data also persisted on stores

– No synchronous writes to base volume– Stores write data+meta-data as one

record• “Store set” persisted base volume

– Small, infrequently changing• Client recovery contact store set• Store recovery read from disk

10Everest: write off-loading for I/O peaks

Page 11: Everest: scaling down peak loads through I/O  off-loading

Everest stores• Short-term, write-optimized storage

– Simple circular log– Small file or partition on existing volume– Not LFS: data is reclaimed, no cleaner

• Monitors load on underlying volume– Only used by clients when lightly loaded

• One store can support many clients

11Everest: write off-loading for I/O peaks

Page 12: Everest: scaling down peak loads through I/O  off-loading

Everest client

Reclaiming in the background

12Everest: write off-loading for I/O peaks

“Read any”

Volume

Everest store

Everest store

Everest store<block range, version, data>

delete(block range, version)

• Multiple concurrent reclaim “threads”– Efficient utilization of disk/network

resources

write

Page 13: Everest: scaling down peak loads through I/O  off-loading

Correctness invariants• I/O on off-loaded range always off-

loaded– Reads: sent to correct location– Writes: ensure latest version is

recoverable– Foreground I/Os never blocked by

reclaim• Deletion of a version only allowed if

– Newer version written to some store, or– Data reclaimed and older versions

deleted• All off-loaded data eventually

reclaimed

13Everest: write off-loading for I/O peaks

Page 14: Everest: scaling down peak loads through I/O  off-loading

Evaluation

14Everest: write off-loading for I/O peaks

• Exchange server traces• OLTP benchmark• Scaling• Micro-benchmarks• Effect of NVRAM• Sensitivity to parameters• N-way off-loading

Page 15: Everest: scaling down peak loads through I/O  off-loading

Exchange server workload• Replay Exchange server trace

– 5000 users, 8 volumes, 7.2 TB, 24 hours• Choose time segments with peaks

– extend segments to cover all reclaim• Our server: 14 disks, 2 TB

– can fit 3 Exchange volumes• Subset of volumes for each segment

15Everest: write off-loading for I/O peaks

Page 16: Everest: scaling down peak loads through I/O  off-loading

Trace segment selection

16Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000

1000000

Time of day

Tota

l I/O

rate

(r

eqs/

s)

Page 17: Everest: scaling down peak loads through I/O  off-loading

Trace segment selection

17Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000

1000000

Time of day

Tota

l I/O

rate

(r

eqs/

s)

Peak 1Peak 2

Peak 3Peak 1Peak 2

Peak 3

Page 18: Everest: scaling down peak loads through I/O  off-loading

Three volumes/segment

18Everest: write off-loading for I/O peaks

Trace

Trace

Trace

Trace

Trace

Trace

Trace

Trace

min

max

medianstore (3%)

store (3%)

store (3%)

client

client

client

Page 19: Everest: scaling down peak loads through I/O  off-loading

Mean response time

19Everest: write off-loading for I/O peaks

Peak 1 reads

Peak 2 reads

Peak 3 reads

Peak 1 writes

Peak 2 writes

Peak 3 writes

0

50

100

150

200No off-loadOff-load

Mea

n re

sp ti

me

(ms)

Page 20: Everest: scaling down peak loads through I/O  off-loading

99th percentile response time

20Everest: write off-loading for I/O peaks

Peak 1 reads

Peak 2 reads

Peak 3 reads

Peak 1 writes

Peak 2 writes

Peak 3 writes

0

500

1000

1500

2000No off-loadOff-load

99%

resp

tim

e (m

s)

Page 21: Everest: scaling down peak loads through I/O  off-loading

Exchange server summary• Substantial improvement in I/O

latency– On a real enterprise server workload– Both reads and writes, mean and 99th pc

• What about application performance?– I/O trace cannot show end-to-end effects

• Where is the benefit coming from?– Extra resources, log structure, ...?

21Everest: write off-loading for I/O peaks

Page 22: Everest: scaling down peak loads through I/O  off-loading

OLTP benchmark

22Dushyanth Narayanan

LogData

Store

Everest client

SQL Server binary

LAN

OLTP client

Detours DLL redirection

• 10 min warmup• 10 min

measurement

Page 23: Everest: scaling down peak loads through I/O  off-loading

OLTP throughput

23Everest: write off-loading for I/O peaks

No off-load Off-load Log struc-tured

2-disk striped

Striped + Log-

struc-tured

0

500

1000

1500

2000

2500

3000

Thro

ughp

ut (t

pm)

Extra disk

+

Log layout

2x disks,3x speedup?

Page 24: Everest: scaling down peak loads through I/O  off-loading

Off-loading not a panacea• Works for short-term peaks• Cannot use to improve perf 24/7• Data usually reclaimed while store

still idle– Long-term off-load eventual contention

• Data is reclaimed before store fills up– Long-term log cleaner issue

24Everest: write off-loading for I/O peaks

Page 25: Everest: scaling down peak loads through I/O  off-loading

Conclusion• Peak I/O is a problem• Everest solves this through off-

loading• By modifying workload at block level

– Removes write from overloaded volume– Off-loading is short term: data is

reclaimed• Consistency, persistence are

maintained– State is always correctly recoverable

25Everest: write off-loading for I/O peaks

Page 26: Everest: scaling down peak loads through I/O  off-loading

Questions?

26Everest: write off-loading for I/O peaks

Page 27: Everest: scaling down peak loads through I/O  off-loading

Why not always off-load?

27Dushyanth Narayanan

DataStore

DataStore

OLTP client OLTP client

WriteReadWrite

ReadWriteRead

SQL Server 1Everest client

SQL Server 2

Page 28: Everest: scaling down peak loads through I/O  off-loading

10 min off-load,10 min contention

28Everest: write off-loading for I/O peaks

Off-load Contention (server 1)

Contention (server 2)

0

1

2

3

4

Spee

dup

Page 29: Everest: scaling down peak loads through I/O  off-loading

Mean and 99th pc (log scale)

29Everest: write off-loading for I/O peaks

Peak 1 reads

Peak 2 reads

Peak 3 reads

Peak 1 writes

Peak 2 writes

Peak 3 writes

1

10

100

1000

10000No off-load Off-load

Resp

onse

tim

e (m

s)

Page 30: Everest: scaling down peak loads through I/O  off-loading

Read/write ratio of peaks

30Everest: write off-loading for I/O peaks

0 10 20 30 40 50 60 70 80 90 1000

0.2

0.4

0.6

0.8

1

% of writes

Cum

ulati

ve fr

actio

n

Page 31: Everest: scaling down peak loads through I/O  off-loading

Exchange server response time

31Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100000

1000000

10000000

100000000

Time of day

Resp

onse

tim

e (s

)

Page 32: Everest: scaling down peak loads through I/O  off-loading

Exchange server load (volumes)

32Everest: write off-loading for I/O peaks

14:3917:0019:2121:4100:0302:2404:4407:0509:2711:4714:09100

1000

10000

100000Max Mean Min

Time of day

Load

(re

qs/s

)

Page 33: Everest: scaling down peak loads through I/O  off-loading

Effect of volume selection

33Everest: write off-loading for I/O peaks

05000

10000150002000025000300003500040000

Peak 1

All Selected

Time of day

Load

(req

s/s/

volu

me)

Page 34: Everest: scaling down peak loads through I/O  off-loading

Effect of volume selection

34Everest: write off-loading for I/O peaks

010000200003000040000500006000070000

Peak 2

All Selected

Time of day

Load

(req

s/s/

volu

me)

Page 35: Everest: scaling down peak loads through I/O  off-loading

Effect of volume selection

35Everest: write off-loading for I/O peaks

02000400060008000

1000012000140001600018000

Peak 3

All Selected

Time of day

Load

(req

s/s/

volu

me)

Page 36: Everest: scaling down peak loads through I/O  off-loading

Scaling with #stores

36Dushyanth Narayanan

LogData

Store

Everest client

SQL Server binary

OLTP client

Detours DLL redirection

Store

Store

LAN

Page 37: Everest: scaling down peak loads through I/O  off-loading

Scaling: linear until CPU-bound

37Everest: write off-loading for I/O peaks

0 0.5 1 1.5 2 2.5 30

2

4

6

Number of stores

Spee

dup

Page 38: Everest: scaling down peak loads through I/O  off-loading

Everest store: circular log layout

Head

Tail

Reclaim

Header block

Active log

Stale records

Delete

38Everest: write off-loading for I/O peaks

Page 39: Everest: scaling down peak loads through I/O  off-loading

Exchange server load: CDF

39Everest: write off-loading for I/O peaks

100 1000 10000 1000000

0.2

0.4

0.6

0.8

1

Request rate per volume (reqs/s)

Cum

ulati

ve fr

actio

n

Page 40: Everest: scaling down peak loads through I/O  off-loading

Unbalanced across volumes

40Everest: write off-loading for I/O peaks

100 1000 10000 1000000

0.2

0.4

0.6

0.8

1MinMeanMax

Request rate per volume (reqs/s)

Cum

ulati

ve fr

actio

n