storage fabric · 2017. 5. 7. · erasure coding in windows azure storage [huang, 2012] exploit...
TRANSCRIPT
Storage FabricCS6453
Summary
Last week: NVRAM is going to change the way we thing about storage.
Today: Challenges of storage layers (SSDs, HDs) that are created from massive
data.
Slowdowns in HDs and SSDs.
Enforcing policies for IO operations in Cloud architectures.
Background: Storage for Big Data
One disk is not enough to handle massive amounts of data.
Last time: Efficient datacenter networks using large number of cheap
commodity switches.
Solution: Efficient IO performance using large number of commodity storage
devices.
Background: RAIDS
Achieves Nx performance where
N is the number of Disks.
Is this for free?
When N becomes large then the
probability of Disk failures
becomes large as well.
RAID 0 does not tolerate
failures.
Background: RAIDS
Achieves (K-1)-fault tolerance
with Kx Disks.
Is this for free?
There are Kx more disks (e.g. if
you want to tolerate 1 failure
you need 2x more Disks than
RAID 0).
RAID 1 does not utilize resources
in an efficient way.
Background: Erasure Code
Achieves K-fault tolerance with
N+K Disks.
Efficient utilization of Disks (not
as great as RAID 0).
Fault-Tolerance (not as great as
RAID 1).
Is this for free?
Reconstruction Cost : # of Disks
needed from a read in case of
failure(s).
RAID 6 has a Reconstruction Cost
of 3.
Modern Erasure Code Techniques
Erasure Coding in Windows Azure Storage [Huang, 2012]
Exploit Point:
ðððð 1 ððððð¢ðð â« ðððð[2 ððððð¢ððð ðð ðððð]
Solution: Construct Erasure Code Technique that has low reconstruction cost for 1
failure.
1.33x more storage overhead (relatively low).
Tolerate up to 3 failures in 16 storage devices.
Reconstruction cost of 6 for 1 failure and 12 for 2+ failures.
The Tail at Store: Problem
We have seen how we treat failures with reconstruction. What about
slowdowns in HDs (or SSDs)?
A slowdown of a disk (no failures) might have significant impact at overall
performance.
Questions:
Do HDs or SSDs exhibit transient slowdowns?
Are slowdowns of disks frequent enough to affect the overall performance?
What causes slowdowns?
How do we deal with slowdowns?
The Tail at Store: Study
RAID
D P Q
Disk SSD
#RAID groups 38,029 572
#Data drives per group 3-26 3-22
#Data drives 458,482 4,069
Total drive hours 857,183,442 7,481,055
Total RAID hours 72,046,373 1,072,690
D ⊠D
0.9
0.92
0.94
0.96
0.98
1
1x 2x 4x 8xSlowdown
CDF of Slowdown (Disk)
SiT
The Tail at Store: Slowdowns?
Hourly average I/O latency per drive ð¿
Slowdown:
ð =ð¿
ð¿ðððððð
Tail:T = ðððð¥
Slow Disks: S ⥠2
ð ⥠2 at 99.8 percentile
ð ⥠1.5 at 99.3 percentile
ð ⥠2 at 97.8 percentile
ð ⥠1.5 at 95.2 percentile
SSDs exhibit even more slowdowns
0
0.2
0.4
0.6
0.8
1
1 2 4 8 16 32 64 128 256
Slowdown Interval (Hours)
CDF of Slowdown Interval
DiskSSD
The Tail at Store: Duration?
Slowdowns are transient
40% of HD slowdowns â¥2
hours
12% of HD slowdowns ⥠10hours
Many slowdowns happen in
consecutive hours (last
more)
0.5
0.6
0.7
0.8
0.9
1
0 5 10 15 20 25 30 35
Inter-Arrival between Slowdowns (Hours)
CDF of Slowdown Inter-Arrival Period
DiskSSD
The Tail at Store: Correlation between
slowdowns in the same storage?
90% of Disk slowdown are within 24 hours of another
slowdown of the same Disk.
> 80% of SSDs slowdown are within 24 hours of
another slowdown of the
same SSD.
Slowdowns happen in the
same Disks relatively close
to each other.
0
0.2
0.4
0.6
0.8
1
0.5x 1x 2x 4xRate Imbalance
CDF of RI within Si >= 2
DiskSSD
The Tail at Store: Causes?
ð ðŒ =ðŒ/ðð ðð¡ð
ðŒ/ðð ðð¡ððððððð
Rate imbalance does not
seem to be the main cause
of slowdowns for slow
Disks.
0
0.2
0.4
0.6
0.8
1
0.5x 1x 2x 4xSize Imbalance
CDF of ZI within Si >= 2
DiskSSD
The Tail at Store: Causes?
ððŒ =ðŒ/ðððð§ð
ðŒ/ðððð§ððððððð
Size imbalance does not
seem to be the main cause
of slowdowns for slow
Disks.
0.95
0.96
0.97
0.98
0.99
1
1x 2x 3x 4x 5x
Slowdown
CDF of Slowdown vs. Drive Age (Disk)
91234576
108
The Tail at Store: Causes?
Disk age seems to have
some correlation but it is
not strongly correlated.
The Tail at Store: Causes?
No correlation of slowdowns to time of the day (0am â 24pm)
No explicit drive events around slow hours
Unplugging disks and plugging them back does not particularly help
SSD vendors have significant differences between them
The Tail at Store: Solutions
Create Tail-Tolerant RAIDS.
Treat slow disks as failed disks.
Reactive
Detect slow Disks: take a lot of time to answer (>2x from other Disks).
Reconstruct answer from other disks using RAID redundancy if Disk is slow.
Latency is going to optimally be around 3x compared to a read from an average Disk.
Proactive
Always use RAID redundancy for additional read.
Take fastest answer.
Uses much more I/O bandwidth.
Adaptive
Combination of both approaches taking into account the findings.
Use reactive approach until a slowdown is detected.
After this use proactive approach since slowdowns are repetitive and last many hours.
The Tail at Store: Conclusions
More research on possible causes for Disk and SSD slowdowns is required
Need Tail-Tolerant RAIDS to reduce the overhead from slowdowns
Since reconstruction of data is the way to deal with slowdowns and if
ðððð 1 ð ððð€ððð€ð â« ðððð[2 ð ððð€ððð€ð ðð ðððð]
the Azure paper [Huang, 2012] becomes more relevant.
Background: Cloud Storage
General Purpose Applications
Separate VM-VM connections from VM-
Storage connections
Storage is virtualized
Many layers from application to actual storage
Resources are shared across multiple tenants
IOFlow: Problem
Cannot support end-to-end policies (e.g.
minimum IO bandwidth from application to
storage)
Applications do not have any way of
expressing their storage policies
Sharing infrastructure where aggressive
applications tend to get more IO bandwidth
IOFlow: Challenges
No existing enforcing mechanism for
controlling IO rates
Aggregate performance policies
Non-performance policies
Admission control
Dynamic enforcement
Support for unmodified applications and VMs
IOFlow: Do it like SDNs
IOFlow: Supported policies
<VM, Destination> -> Bandwidth (static, compute side)
<VM, Destination> -> Min Bandwidth (dynamic, compute side)
<VM, Destination> -> Sanitize (static, compute or storage side)
<VM, Destination> -> Priority Level (static, compute and storage side)
<Set of VMs, Set of Destinations> -> Bandwidth (dynamic, compute side)
Example 1: Interface
Policies:
<VM1,Server X> -> B1
<VM2,Server X> -> B2
Controller to SMBc of physical server containing VM1 and VM2
createQueueRule(<VM1,Server X>,Q1)
createQueueRule(<VM2,Server X>,Q2)
createQueueRule(<*,*>,Q0)
configureQueueService(Q1, <B1, low, S>), where S is the size of the queue
configureQueueService(Q2, <B2, low, S>)
configureQueueService(Q0, <C-B1-B2, low, S>), where C is the Capacity of Server X.
Example 2: Max-Min Fairness
Policies:
<VM1-VM3,Server X> -> 900 Mbps
Demand:
VM1 -> 600 Mbps
VM2 -> 400 Mbps
VM3 -> 200 Mbps
Result:
VM1 -> 350 Mbps
VM2 -> 350 Mbps
VM3 -> 200 Mbps
IOFlow: Evaluation of Policy
Enforcement
Windows-based IO stack
10 hypervisors with 12 VMs each (120 VMs total)
4 tenants using 30 VMs each (3 VMs per hypervisor for each tenant)
1 Storage Server
6.4 Gbps IO Bandwidth
1 Controller
1s interval between dynamic enforcements of policies
IOFlow: Evaluation of Policy
Enforcement
Tenant Policy
Index {VM 1 -30, X} -> Min 800 Mbps
Data {VM 31 - 60, X} -> Min 800 Mbps
Message {VM 61 -90, X} -> Min 2500 Mbps
Log {VM 91 -120, X} -> Min 1500 Mbps
IOFlow: Evaluation of Policy Enforcement
IOFlow: Evaluation of Overhead
IOFlow: Conclusions
Contributions
First Software Defined Storage approach
Fine-grain control over the IO operations in Cloud
Limitations
Network or other resources might be the bottleneck
Need to care about locating the VMs (spatial locality) close to data
Flat Datacenter Storage [Nightingale, 2012] provides solutions for this problem
Guaranteed latencies are not expressed by current policies
Best effort approach by setting priority
Specialized Storage Architectures
HDFS [Shvachko, 2009] and GFS [Ghemawat, 2003] work well for Hadoop
MapReduce applications.
Facebookâs Photo Storage [Beaver, 2010] exploits workload characteristics to
design and implement better storage system.