optimize storage efficiency & performance with erasure coding

37
Dror Goldenberg VP Software Architecture Mellanox Technologies Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload

Upload: vominh

Post on 10-Dec-2016

257 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Optimize Storage Efficiency & Performance with Erasure Coding

Dror Goldenberg

VP Software Architecture Mellanox Technologies

Optimize Storage Efficiency & Performance with Erasure Coding

Hardware Offload

Page 2: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

SNIA Legal Notice

The material contained in this tutorial is copyrighted by the SNIA unless otherwise noted. Member companies and individual members may use this material in presentations and literature under the following conditions:

Any slide or slides used must be reproduced in their entirety without modification The SNIA must be acknowledged as the source of any material used in the body of any document containing material from these presentations.

This presentation is a project of the SNIA Education Committee. Neither the author nor the presenter is an attorney and nothing in this presentation is intended to be, or should be construed as legal advice or an opinion of counsel. If you need legal advice or a legal opinion please contact your attorney. The information presented herein represents the author's personal opinion and current understanding of the relevant issues involved. The author, the presenter, and the SNIA do not assume any responsibility or liability for damages arising out of any reliance on or use of this information. NO WARRANTIES, EXPRESS OR IMPLIED. USE AT YOUR OWN RISK.

2

Page 3: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Abstract

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload

Nearly all object storage, including Ceph and Swift, support erasure coding because it is a more efficient data protection method than simple replication or traditional RAID. However, erasure coding is very CPU intensive and typically slows down storage performance significantly. Now Ethernet network cards are available that offload erasure coding calculations to hardware for both writing and reconstructing data. This offload technology has the potential to change the storage market by allowing customers to deploy more efficient storage without sacrificing performance. Attend this presentation to learn how erasure coding hardware offloads work and how they can integrate with products such as Ceph.

Learning Objectives Learn the benefits and costs of erasure coding Understand how erasure coding works in products such as Ceph See how erasure coding hardware offloads accelerate storage performance

3

Page 4: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Software Defined Storage

Software Defined Datacenter Software Defined Networks Software Defined Storage

Scale Out Advantage

Scale-out Storage Examples

Ceph, Swift, Gluster and many more

4

Compute

Network

Storage

Page 5: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

D3 D0 D0

Ensuring Data Availability

Data Replication Erasure Coding

D0 D0 D0 D0 P0

D

D1

D2

D3

D1 D1

D2 D2

D3 D3

D1

D2

D3

P1

P2

P3

5

Page 6: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Ensuring Data Availability

Data Replication Erasure Coding

D0 D0 D0 D0 P0

D1

D2

D3

D1 D1

D2 D2

D3 D3

D1

D2

D3

P1

P2

P3

6

Page 7: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Ensuring Data Availability - Summary

Data Replication Capacity: 3x data typical Resilient to 2 failures

Erasure Coding Capacity: 1.4x data typical Better failure resilience

e.g. 10+4: 1.4x capacity, 4 failures CPU & network hungry Longer & data intensive rebuild Partial update requires read

7

D D D D

D D D D

D D D D

D0

D0

D0

D0

D1

D1

D1

D1

D2

D2

D2

D2

D3

D3

D3

D3

P0

P0

P0

P0

P1

P1

P1

P1

Page 8: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

The Value of Erasure Coding Offload

System cost Reduce overall system cost – use cheaper CPU

Enable hyperconverged systems Efficient compute - reduce overall storage overhead

Enable value add storage disaggregation Lightweight clients can efficiently add data availability

Efficient data recovery/rebuild Systems spend long time in degraded state during rebuild Ensure rebuild has minimal impact on operational characteristics

Network efficiency with EC calculation and I/O Avoid extra interrupts and context switches

Better parallelism

8

Page 9: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Erasure Codes Theory

K data + M parity = N total Tolerates up to M failures Total overhead N/K

Systematic Codes – encoded data contains original data Maximal Distance Separable (MDS)

Can survive any M erasures

Reed Solomon Coding – this session focus

Many other codes exist: RAID 6, XOR, Pyramid, LRC, … 9

D D D D D P P P

K M

Page 10: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Reed Solomon Encoding

Generating parity is matrix multiplication

10

1 D0 D01 D1 D1

1 D2 D21 D3 D3

1 D4 D41 D5 D5

1 D6 D61 D7 D7

1 D8 D81 D9 D9

B11 B12 B13 B14 B15 B16 B17 B18 B19 B110 P0B21 B22 B23 B24 B25 B26 B27 B28 B29 B210 P1B31 B32 B33 B34 B35 B36 B37 B38 B39 B310 P2B41 B42 B43 B44 B45 B46 B47 B48 B49 B410 P3

* =

B * D = {D,P}

k

k

m

Page 11: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Reed Solomon Decoding (1/2)

In case of failures Matrix is reduced Calculate Inverse matrix Recover Data Recover Parity

11

1 D0 D01 D1 D1

1 D2 D21 D3 D3

1 D4 D41 D5 D5

1 D6 D61 D7 D7

1 D8 D81 D9 D9

B11 B12 B13 B14 B15 B16 B17 B18 B19 B110 P0B21 B22 B23 B24 B25 B26 B27 B28 B29 B210 P1B31 B32 B33 B34 B35 B36 B37 B38 B39 B310 P2B41 B42 B43 B44 B45 B46 B47 B48 B49 B410 P3

* =

B' * D = {D,P}'

k

k

m

Page 12: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Reed Solomon Decoding (2/2)

Recovery - how? B’ * D = S B’-1 * B’ * D = B’ -1 * S D=B’ -1 * S {D,P}=B * D

S = Survivors vector

12

1 D0 D01 D1 D1

1 D2 D21 D3 D4

1 D4 D51 D5 D7

1 D6 D8B11 B12 B13 B14 B15 B16 B17 B18 B19 B110 D7 P0B21 B22 B23 B24 B25 B26 B27 B28 B29 B210 D8 P1B41 B42 B43 B44 B45 B46 B47 B48 B49 B410 D9 P3

* =

B' * D = {D,P}'= S

Page 13: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Algebra Magic (1/2) – Matrix Generation

All submatrices must be invertible Vandermonde matrix is used as baseline Derived matrix through elementary operations

13

1 1 1^2 1^3 1^4 … 1^(k-1) 11 2 2^2 2^3 2^4 … 2^(k-1) 11 3 3^2 3^3 … 11 4 4^2 … 1. . 1. . 1. . 1. . 1

11 k-1 11 k B11 B12 B13 B14 B15 B16 B17 B18 B19 B110

1 k+1 B21 B22 B23 B24 B25 B26 B27 B28 B29 B210

… … B31 B32 B33 B34 B35 B36 B37 B38 B39 B310

1 n-1 (n-1)^2 (n-1)^3 … B41 B42 B43 B44 B45 B46 B47 B48 B49 B410

Vandermonde Matrix Derived Matrix

k

k

m

k

k

m

Page 14: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Algebra Magic (2/2) – Arithmetic OPs

Needed: finite field, multiplicative inverse Operation done over Galois Field – GF(2w)

Sum is XOR operation Multiplication is more complicated…

Numbers are multiplied and then divided by an irreducible polynomial

n must be ≤ 2w

14

Page 15: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

High CPU Demand

Matrix multiplication compute needed O(k*m) multiply-add operations

Cache & TLB intensive (large data sets)

15

Page 16: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Erasure Code History

Évariste Galois 1811-1832 Alexandre-Théophile Vandermonde 1735 – 1796 Irving S. Reed 1923-2012 Gustave Solomon 1930-1996

16

http://selunec.deviantart.com/art/Evariste-Galois-189586719

http://www.giancarlodurso.altervista.org/blog/ interpolazione-polinomiale-con-matrice-di-vandermonde/

http://reedsolomon.tripod.com/

Page 17: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Network Traffic – Sunny Day Scenario (Replication)

17

Client OSD

Write

Write Ack

OSD OSD

Replications

N Write Operation

Client OSD

Read

Read Reply

OSD OSD

N READ Operation

Typically 2x cluster network traffic (N-1)∙x

No extra cluster network traffic

Page 18: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Network Traffic – Sunny Day Scenario (Erasure Coding)

18

Client OSD

Read

Read Reply

K

OSD OSD OSD OSD OSD

M

Read Shards

decode

READ Operation

Client OSD

Write

Write Ack

K

OSD OSD OSD OSD OSD

M

Write Shards

encode

Write Operation

~1x cluster network traffic ((k-1)/k)∙x

Typically ~1.4x cluster network traffic ((k+m-1)/k) ∙x

Page 19: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Network Traffic – Recovery (Replication)

19

Example - Time to recover Net networking time to move data 20TB system @40GE 1.1hrs 200TB system @40GE 11.1hrs

Similar flows for scrubbing

Client OSD OSD OSD OSD

Read

Read Reply

Recovery backend traffic 1x times lost data

Page 20: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Network Traffic – Recovery (Erasure Coding)

20

Example - Time to recover (10+4) Net networking time to move data 20TB system @40GE 14.4hrs 200TB system @40GE 144.4hrs

Similar flows for scrubbing

Client OSD

K

OSD OSD OSD OSD OSD

M

Read Shards

OSD

decode

Recovery backend traffic typically ~14x times lost data (K+M-1)∙x

Tradeoff: recovery time

vs storage efficiency

Page 21: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Erasure Coding Offload Primitives

Initialization Encode Decode Update Send

21

Page 22: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Typical Workflow Long Write (Encode)

22

write(*data)

Data stripe

Encode

D D D D D

D D D D D P P P

Send

Calculation B*D=P

Long Write D

Page 23: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Typical Workflow Reconstruct (Decode)

23

decode(*data)

Retrieve Available Elements

Decode (recover lost data)

D D D D D P P P

Send

Calculation D=B’-1*Survivors

D D D D P P D P

D D D D D Calculation

B*D=P Encode (calculate parity)

Page 24: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Typical Workflow Short Write (Update)

24

write(*data)

Data stripe Retrieve parity and modified elements

Encode

D D

D’ D’ P’ P’ P’

Send

Calculation B*(D+D’)+P=P’

Short Write D’

P P P

D’ D’

Page 25: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Synchronous vs Asynchronous

Synchronous APIs block until operation completes Suitable to the current common API semantics CPU computes – runs to completion Onload operation – faster ISA, but still 100% CPU utilization

Asynchronous enables CPU to focus on computation Suitable to offload semantics Operation starts, completion reported upon callback Can implement Synchronous using Asynchronous calls

Easy fit for today’s integrations

25

Page 26: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Onload vs Offload

Onload Computation all done on CPU CPU at 100% during calculation Cache/TLB pollution Example: ISA-L

Offload Computation all done in accelerator CPU at 0% during calculation Cache/TLB unaffected Example: ec_offload APIs

26

Page 27: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Networking Adapter and Verbs - Briefing

HCA – Host Card Adapter Asynchronous interface

Consumer posts work requests HCA processes Consumer polls completions

I/O channel exposed usermode apps Transport services

Reliable / Unreliable Connected / Datagram Send/Receive, RDMA, Atomic operations Data calculations

Offloading Transport executed by HCA Kernel bypass RDMA

27

Port

VL VL VL VL …

Port

VL VL VL VL …

Transport and RDMA Offload Engine

Send Queue

Receive Queue

QP

Send Queue

Receive Queue

QP

Consumer

Completion Queue

posting WQEs

polling CQEs RDMA Network Adapter

Page 28: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

EC Offload APIs Cheatsheet

Synchronous Asynchronous

Initialization ibv_exp_alloc_ec_calc() ibv_exp_dealloc_ec_calc()

-

Encode ibv_exp_ec_encode_sync() ibv_exp_ec_encode_async()

Decode ibv_exp_ec_decode_sync() ibv_exp_ec_decode_async()

Update ibv_exp_ec_update_sync() ibv_exp_ec_update_async()

Encode & Send - ibv_exp_ec_encode_send()

Bookeeping - ibv_exp_ec_poll()

28

Page 29: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Encoding Performance (Single Core)

29

3.7x Faster

24x Lower C

PU%

Preliminary Numbers Optimizations ongoing

Page 30: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Integration into Storage Platforms

Currently Work in Progress Leverage community work

30

Storage Platform Ceph, HDFS, etc.

Erasure Code Subsystem

Plugin APIs, e.g. jerasure

C/Java Code

ISA-L EC Offload

Initial Integration (Synchronous)

Storage Platform Ceph, HDFS, etc.

Erasure Code Subsystem

Plugin APIs, e.g. jerasure

C/Java Code

ISA-L EC Offload +

Networking

Deeper Integration (Async, Send)

Network Subsytem

Page 31: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

EC Offload APIs - Current Status

Opensource BSD/GPL license Supported on ConnectX-4 & ConnectX-4 LX Code available today (MLNX_OFED 3.3 and up)

Looking at integrations into opensource and commercial applications

Ceph, HDFS, …

31

Page 32: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Summary

Software Defined Storage drives scale out storage Erasure Codes enable data availability at lower capacity

Tradeoff: CPU & Network intensive, complexity

Erasure Codes offload offers Better performance Lower cost

Library available today Integration to storage systems underway

Challenges Efficient use of asynchronous acceleration Combining Erasure Codes offload and networking

32

Page 33: Optimize Storage Efficiency & Performance with Erasure Coding

Thank You !

Page 34: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

Attribution & Feedback

34

Please send any questions or comments regarding this SNIA Tutorial to [email protected]

The SNIA Education Committee thanks the following Individuals for their contributions to this Tutorial.

Authorship History 09/2016 Dror Goldenberg

Additional Contributors Joseph L White Marty Foltyn

Page 35: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved. 35

Resources – Networking & Erasure Codes

Erasure codes http://web.eecs.utk.edu/~plank/plank/papers/FAST-2005.pdf Swift Object Storage: Adding Erasure Codes - SNIA Tutorial http://www.snia.org/sites/default/files/Luse_Kevin_SNIATutorialSwift_Object_Storage2014_final.pdf Linux RDMA mailing list http://www.mail-archive.com/[email protected]/

InfiniBand® Trade Association http://www.infinibandta.org/home

Page 36: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

API - Details

ibv_exp_alloc_ec_calc(*pd, *attr) ibv_exp_dealloc_ec_calc(*calc) ibv_exp_ec_encode_async(*calc, *ec_mem, *ec_comp) ibv_exp_ec_encode_sync(*calc, *ec_mem) ibv_exp_ec_decode_async(*calc, *ec_mem, *erasures, *decode_matrix, *ec_comp) ibv_exp_ec_decode_sync(*calc,*ec_mem,*erasures,*decode_matrix) ibv_exp_ec_update_async(*calc,*ec_mem,*data_updates,*code_updates,*ec_comp) ibv_exp_ec_update_sync(*calc,*ec_mem,*data_updates,*code_updates) ibv_exp_ec_poll(*calc, n) ibv_exp_ec_encode_send(*calc,*ec_mem,*data_stripes,*code_stripes)

36

Page 37: Optimize Storage Efficiency & Performance with Erasure Coding

Optimize Storage Efficiency & Performance with Erasure Coding Hardware Offload Approved SNIA Tutorial © 2016 Storage Networking Industry Association. All Rights Reserved.

API - Initialization

struct ibv_exp_ec_calc_init_attr { uint32_t comp_mask; uint32_t max_inflight_calcs; int k; int m; int w; int max_data_sge; int max_code_sge; uint8_t *encode_matrix; int affinity_hint; int polling; };

37