storage fabric · 2017. 5. 7. · erasure coding in windows azure storage [huang, 2012] exploit...

Storage FabricCS6453

Summary

Last week: NVRAM is going to change the way we thing about storage.

Today: Challenges of storage layers (SSDs, HDs) that are created from massive

data.

Slowdowns in HDs and SSDs.

Enforcing policies for IO operations in Cloud architectures.

Background: Storage for Big Data

One disk is not enough to handle massive amounts of data.

Last time: Efficient datacenter networks using large number of cheap

commodity switches.

Solution: Efficient IO performance using large number of commodity storage

devices.

Background: RAIDS

Achieves Nx performance where

N is the number of Disks.

Is this for free?

When N becomes large then the

probability of Disk failures

becomes large as well.

RAID 0 does not tolerate

failures.

Background: RAIDS

Achieves (K-1)-fault tolerance

with Kx Disks.

Is this for free?

There are Kx more disks (e.g. if

you want to tolerate 1 failure

you need 2x more Disks than

RAID 0).

RAID 1 does not utilize resources

in an efficient way.

Background: Erasure Code

Achieves K-fault tolerance with

N+K Disks.

Efficient utilization of Disks (not

as great as RAID 0).

Fault-Tolerance (not as great as

RAID 1).

Is this for free?

Reconstruction Cost : # of Disks

needed from a read in case of

failure(s).

RAID 6 has a Reconstruction Cost

of 3.

Modern Erasure Code Techniques

Erasure Coding in Windows Azure Storage [Huang, 2012]

Exploit Point:

𝑃𝑟𝑜𝑏 1 𝑓𝑎𝑖𝑙𝑢𝑟𝑒 ≫ 𝑃𝑟𝑜𝑏[2 𝑓𝑎𝑖𝑙𝑢𝑟𝑒𝑠 𝑜𝑟 𝑚𝑜𝑟𝑒]

Solution: Construct Erasure Code Technique that has low reconstruction cost for 1

failure.

1.33x more storage overhead (relatively low).

Tolerate up to 3 failures in 16 storage devices.

Reconstruction cost of 6 for 1 failure and 12 for 2+ failures.

The Tail at Store: Problem

We have seen how we treat failures with reconstruction. What about

slowdowns in HDs (or SSDs)?

A slowdown of a disk (no failures) might have significant impact at overall

performance.

Questions:

Do HDs or SSDs exhibit transient slowdowns?

Are slowdowns of disks frequent enough to affect the overall performance?

What causes slowdowns?

How do we deal with slowdowns?

The Tail at Store: Study

RAID

D P Q

Disk SSD

#RAID groups 38,029 572

#Data drives per group 3-26 3-22

#Data drives 458,482 4,069

Total drive hours 857,183,442 7,481,055

Total RAID hours 72,046,373 1,072,690

D … D

0.9

0.92

0.94

0.96

0.98

1

1x 2x 4x 8xSlowdown

CDF of Slowdown (Disk)

SiT

The Tail at Store: Slowdowns?

Hourly average I/O latency per drive 𝐿

Slowdown:

𝑆 =𝐿

𝐿𝑚𝑒𝑑𝑖𝑎𝑛

Tail:T = 𝑆𝑚𝑎𝑥

Slow Disks: S ≥ 2

𝑆 ≥ 2 at 99.8 percentile

𝑆 ≥ 1.5 at 99.3 percentile

𝑇 ≥ 2 at 97.8 percentile

𝑇 ≥ 1.5 at 95.2 percentile

SSDs exhibit even more slowdowns

0

0.2

0.4

0.6

0.8

1

1 2 4 8 16 32 64 128 256

Slowdown Interval (Hours)

CDF of Slowdown Interval

DiskSSD

The Tail at Store: Duration?

Slowdowns are transient

40% of HD slowdowns ≥2

hours

12% of HD slowdowns ≥ 10hours

Many slowdowns happen in

consecutive hours (last

more)

0.5

0.6

0.7

0.8

0.9

1

0 5 10 15 20 25 30 35

Inter-Arrival between Slowdowns (Hours)

CDF of Slowdown Inter-Arrival Period

DiskSSD

The Tail at Store: Correlation between

slowdowns in the same storage?

90% of Disk slowdown are within 24 hours of another

slowdown of the same Disk.

> 80% of SSDs slowdown are within 24 hours of

another slowdown of the

same SSD.

Slowdowns happen in the

same Disks relatively close

to each other.

0

0.2

0.4

0.6

0.8

1

0.5x 1x 2x 4xRate Imbalance

CDF of RI within Si >= 2

DiskSSD

The Tail at Store: Causes?

𝑅𝐼 =𝐼/𝑂𝑅𝑎𝑡𝑒

𝐼/𝑂𝑅𝑎𝑡𝑒𝑚𝑒𝑑𝑖𝑎𝑛

Rate imbalance does not

seem to be the main cause

of slowdowns for slow

Disks.

0

0.2

0.4

0.6

0.8

1

0.5x 1x 2x 4xSize Imbalance

CDF of ZI within Si >= 2

DiskSSD


𝑆𝐼 =𝐼/𝑂𝑆𝑖𝑧𝑒

𝐼/𝑂𝑆𝑖𝑧𝑒𝑚𝑒𝑑𝑖𝑎𝑛

Size imbalance does not

seem to be the main cause

of slowdowns for slow

Disks.

0.95

0.96

0.97

0.98

0.99

1

1x 2x 3x 4x 5x

Slowdown

CDF of Slowdown vs. Drive Age (Disk)

91234576

108


Disk age seems to have

some correlation but it is

not strongly correlated.


No correlation of slowdowns to time of the day (0am – 24pm)

No explicit drive events around slow hours

Unplugging disks and plugging them back does not particularly help

SSD vendors have significant differences between them

The Tail at Store: Solutions

Create Tail-Tolerant RAIDS.

Treat slow disks as failed disks.

Reactive

Detect slow Disks: take a lot of time to answer (>2x from other Disks).

Reconstruct answer from other disks using RAID redundancy if Disk is slow.

Latency is going to optimally be around 3x compared to a read from an average Disk.

Proactive

Always use RAID redundancy for additional read.

Take fastest answer.

Uses much more I/O bandwidth.

Adaptive

Combination of both approaches taking into account the findings.

Use reactive approach until a slowdown is detected.

After this use proactive approach since slowdowns are repetitive and last many hours.

The Tail at Store: Conclusions

More research on possible causes for Disk and SSD slowdowns is required

Need Tail-Tolerant RAIDS to reduce the overhead from slowdowns

Since reconstruction of data is the way to deal with slowdowns and if

𝑃𝑟𝑜𝑏 1 𝑠𝑙𝑜𝑤𝑑𝑜𝑤𝑛 ≫ 𝑃𝑟𝑜𝑏[2 𝑠𝑙𝑜𝑤𝑑𝑜𝑤𝑛 𝑜𝑟 𝑚𝑜𝑟𝑒]

the Azure paper [Huang, 2012] becomes more relevant.

Background: Cloud Storage

General Purpose Applications

Separate VM-VM connections from VM-

Storage connections

Storage is virtualized

Many layers from application to actual storage

Resources are shared across multiple tenants

IOFlow: Problem

Cannot support end-to-end policies (e.g.

minimum IO bandwidth from application to

storage)

Applications do not have any way of

expressing their storage policies

Sharing infrastructure where aggressive

applications tend to get more IO bandwidth

IOFlow: Challenges

No existing enforcing mechanism for

controlling IO rates

Aggregate performance policies

Non-performance policies

Admission control

Dynamic enforcement

Support for unmodified applications and VMs

IOFlow: Do it like SDNs

IOFlow: Supported policies

<VM, Destination> -> Bandwidth (static, compute side)

<VM, Destination> -> Min Bandwidth (dynamic, compute side)

<VM, Destination> -> Sanitize (static, compute or storage side)

<VM, Destination> -> Priority Level (static, compute and storage side)

<Set of VMs, Set of Destinations> -> Bandwidth (dynamic, compute side)

Example 1: Interface

Policies:

<VM1,Server X> -> B1

<VM2,Server X> -> B2

Controller to SMBc of physical server containing VM1 and VM2

createQueueRule(<VM1,Server X>,Q1)

createQueueRule(<VM2,Server X>,Q2)

createQueueRule(<*,*>,Q0)

configureQueueService(Q1, <B1, low, S>), where S is the size of the queue

configureQueueService(Q2, <B2, low, S>)

configureQueueService(Q0, <C-B1-B2, low, S>), where C is the Capacity of Server X.

Example 2: Max-Min Fairness

Policies:

<VM1-VM3,Server X> -> 900 Mbps

Demand:

VM1 -> 600 Mbps

VM2 -> 400 Mbps

VM3 -> 200 Mbps

Result:

VM1 -> 350 Mbps

VM2 -> 350 Mbps

VM3 -> 200 Mbps

IOFlow: Evaluation of Policy

Enforcement

Windows-based IO stack

10 hypervisors with 12 VMs each (120 VMs total)

4 tenants using 30 VMs each (3 VMs per hypervisor for each tenant)

1 Storage Server

6.4 Gbps IO Bandwidth

1 Controller

1s interval between dynamic enforcements of policies

IOFlow: Evaluation of Policy

Enforcement

Tenant Policy

Index {VM 1 -30, X} -> Min 800 Mbps

Data {VM 31 - 60, X} -> Min 800 Mbps

Message {VM 61 -90, X} -> Min 2500 Mbps

Log {VM 91 -120, X} -> Min 1500 Mbps

IOFlow: Evaluation of Policy Enforcement

IOFlow: Evaluation of Overhead

IOFlow: Conclusions

Contributions

First Software Defined Storage approach

Fine-grain control over the IO operations in Cloud

Limitations

Network or other resources might be the bottleneck

Need to care about locating the VMs (spatial locality) close to data

Flat Datacenter Storage [Nightingale, 2012] provides solutions for this problem

Guaranteed latencies are not expressed by current policies

Best effort approach by setting priority

Specialized Storage Architectures

HDFS [Shvachko, 2009] and GFS [Ghemawat, 2003] work well for Hadoop

MapReduce applications.

Facebook’s Photo Storage [Beaver, 2010] exploits workload characteristics to

design and implement better storage system.

storage fabric · 2017. 5. 7. · erasure coding in windows azure storage [huang, 2012] exploit...

Documents