design & implementation of lh* rs : a h ighly- available distributed d ata structure

Design & Implementation of LH*RS : a Highly- Available Distributed Data

Structure

Rim MoussaRim Moussa [email protected] [email protected]

http://ceria.dauphine.fr/rim/rim.htmlhttp://ceria.dauphine.fr/rim/rim.html

Thomas J.E. Thomas J.E.

[email protected] [email protected]

http://www.cse.scu.edu/~tschwarz/http://www.cse.scu.edu/~tschwarz/

homepage/thomas_schwarz.html homepage/thomas_schwarz.html

Workshop in Distributed Data & Structures

*July 2004

2

Objective

Factors of Interest are Factors of Interest are : :

Parity OverheadParity Overhead

Recovery PerformancesRecovery Performances

LH*LH*RSRS

Design Design

Implementation Implementation

Performance Performance MeasurementsMeasurements

3

1. Motivation

2. Highly-available schemes

3. LH*RS

4. Architectural Design

5. Hardware testbed

6. File Creation

7. High Availability

8. Recovery

9. Conclusion

10.Future Work

Scenario Description

Performance Results

Overview

4

Motivation

Information Volume of 30% / year Bottleneck of disk access and CPUs Failures are frequent & costly

Business Operation

Industry Average Hourly Financial Impact

Brokerage (Retail) operations

Financial $6.45 million

Credit Card Sales Authorization

Financial $2.6 million

Airline Reservation

Centers

Transportation $89,500

Cellular (new) Service Activation

Communication $41,000

Source: Contingency Planning Research -1996Source: Contingency Planning Research -1996

5

Requirements

NeedNeed

Highly Available Highly Available

Networked Data Networked Data

Storage SystemsStorage Systems

Scalability Scalability

High High Throughput Throughput

High AvailabilityHigh Availability

6

SScalable & calable & DDistributed istributed DData ata SStructuretructure

Dynamic file growth

Client

Network

Client…

Data Buckets (DBs)

…

Coordinator

I’m Overloaded !

You Split !Records Transfered

Inserts

…

7

SDDS (Ctnd.)

Network

No Centralized Directory Access

Data Buckets (DBs)

……

Client

Query Query Forwarded

Image Adjustement

Message

…

8

Solutions towards High Availability

Parity Calculus

(+)(+) Good Response time since mirrors are queried Good Response time since mirrors are queried

(-)(-) High storage cost (High storage cost (n n if if nn repliquas) repliquas)

Data ReplicationData Replication

Erasure-resilient codes are evaluated regarding:Erasure-resilient codes are evaluated regarding:

Coding Rate (parity volume / data volume)Coding Rate (parity volume / data volume)

Update PenalityUpdate Penality

Group Size used for Data Reconstruction Group Size used for Data Reconstruction

Complexity of Coding & Decoding Complexity of Coding & Decoding

9

Fault-Tolerant Schemes1 server failure

More than 1 server failure

Binary linear codes: [Hellerstein & al., 94]

Array Codes: EVENODD [Blaum et al., 94 ],

X-code [Xu et al.,99],

RDP schema [Corbett et al., 04]

Tolerate just 2 failures

Reed Solomon Codes:

IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH*RS[Litwin & Schwarz, 00]

Tolerate large number of failures

…

Simple XOR parity calculus : RAID Systems [Patterson et al., 88],

The SDDS LH*g [Litwin et al., 96]

A Highly Available & Distributed Data Structure: LH*RS

[Litwin & Schwarz, 00][Litwin, Moussa & Schwarz, sub.]

11

LH*RS

SDDSSDDSData Distribution scheme based on Linear Hashing : LH*LH [Karlesson et al., 96] applied to the key-field

Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63]

Scalability Scalability

High High Throughput Throughput

High AvailabilityHigh Availability

12

LH*RS File Structure

Data Buckets

Parity Buckets

: Key Data Field

Insert Rank r

: Rank [Key List] Parity Field

Key r

2

1

0

2

1

0

Architectural Design of LH*RS

14

Communication

Use of TCP/IPUse of TCP/IP

New PB CreationNew PB Creation

Large Update Transfer (DB split)Large Update Transfer (DB split)

Bucket Recovery Bucket Recovery

Use of UDPUse of UDP

Individual Insert/ Update/ Delete/ Individual Insert/ Update/ Delete/ Search QueriesSearch Queries

Record RecoveryRecord Recovery

Service and Control MessagesService and Control Messages

SpeedSpeed

Better Better Performance & Performance &

Reliability Reliability than UDPthan UDP

15

Network

Multicast Listening Port

Send UDP Port

Message Queue

-Message processing-

TCP/IP Port

Process Buffer

Recv UDP Port

Message Queue

-Message processing-

……

Message

TCP Connecti

on

WindowFree Zones

Sending Credit

Messages waiting for ack

…Not ack’ed messages

Multicast

Listening

Thread

Multicast

Working

Thread

Ack.

Management

Thread

UDP Listening

Thread TCP Listening

Thread

Work. Thread

1Work. Thread

n

Bucket Architecture

16

Architectural Design

TCP/IP Connection Handler

TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00]

Enhancements to SDDS2000 [B00, D01] Bucket Enhancements to SDDS2000 [B00, D01] Bucket

ArchitectureArchitecture

Flow Control and Acknowledgement Mgmt.

Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88] [Diène, 01]

Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 sNew Architecture: 2.6 s Improv. 60%

(Hardware config.: 733MhZ machines, 100Mbps network)

Before

17

Architectural Design (Ctnd.)

A pre-defined & static

IP @s Table

Dynamic IP@ Structure

Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe

DBs, PBsDBs, PBs

Coordinator

Blank DBs Multicast

Group

Blank PBs Multicast

Group

Multicast Component

Befor

e

18

Hardware Testbed

5 Machines (Pentium IV: 5 Machines (Pentium IV: 1.8 GHz1.8 GHz, RAM: 512 Mb) , RAM: 512 Mb)

Ethernet Network: max bandwidth of Ethernet Network: max bandwidth of 1 Gbps1 Gbps

Operating System: Windows 2K ServerOperating System: Windows 2K Server

Tested configurationTested configuration 1 Client1 Client A group of 4 Data BucketsA group of 4 Data Buckets k k Parity Buckets, Parity Buckets, k k {0, 1, 2} {0, 1, 2}

LH*RS

File Creation

20

File CreationClient Operation Client Operation

Splitting Data BucketSplitting Data Bucket

PBs :PBs : (Records that Remain) (Records that Remain) NN Deletes -from old rank & Deletes -from old rank & NN Inserts -at new rank + Inserts -at new rank + (Records that move)(Records that move) N N DeletesDeletes

New Data BucketNew Data Bucket

PBs:PBs: NN Inserts (Moved Records) Inserts (Moved Records)

All Updates are gathered in the same buffer and transferred All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the (TCP/IP) simultaneously to respective Parity Buckets of the

Splitting DB Splitting DB & & New DBNew DB..

Propagation of each Insert/ Update/ Delete on Data Record to Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Parity Buckets

Data Bucket SplitData Bucket Split

21

File Creation Perf.Experiments Set-up

File of 25 000 data records; 1 data record = 104 B

7,896s9,990s

10,963s

0,0002,0004,0006,0008,000

10,00012,000

0 5000 10000 15000 20000 25000

Inserted Keys

File

Cre

atio

n Ti

me

(sec

)

k = 0k = 1k = 2

Client Sending Credit = 1Client Sending Credit = 1 Client Sending Credit = Client Sending Credit = 55

kk = 1 to = 1 to kk = 2 = 2

Perf. Perf.

Degradation of Degradation of

8%8%

PB Overhead

kk = 0 to = 0 to kk = 1 = 1

Perf. Perf.


20%20%

22

File Creation Perf.Experimental Set-up

File of 25 000 data records; 1 data record = 104 B

Client Sending Credit = 1Client Sending Credit = 1 Client Sending Credit = Client Sending Credit = 55

kk = 1 to = 1 to kk = 2 = 2

Perf. Perf.


10%10%

PB Overhead

4,349s

6,940s7,720s

0

2

4

6

8

10

0 5000 10000 15000 20000 25000

Number of Inserted Keys

File

Cre

atio

n Ti

me

(sec

)

k = 0k = 1k = 2

kk = 0 to = 0 to kk = 1 = 1

Perf. Perf.


37%37%

LH*RS

Parity Bucket Creation

24

PB Creation Scenario

Coordinator

PBs Connected to The Blank PBs Multicast

Group

Wanna join group g ? <Multicast>

[Sender IP@+Entity#, Your Entity#]

Searching for a new PB

25


Coordinator


Group

I would

I would

I would

Start UDP Listening, Start TCP Listening,

Start Working Threads

Waiting for Confirmation, If Time-out elapsed

Cancel all

Waiting for Replies

26


Coordinator


Group

You are Selected <UDP> Disconnect from Blank

PBs Multicast Group

Cancellation

CancellationPB Selection

27

Send me your contents ! <UDP>…


Data Bucket’s group

New PB

…

Auto-creation -Query phase

28

Requested Buffer <TCP>…


Data Bucket’s group

New PB

Buffer Processing

…

Auto-creation –Encoding phase

29

PB Creation Perf.

Bucket Size

Total Time (sec)

Processing Time

(sec)

Communication

Time (sec)

5000 0.190 0.140 0.029

10000 0.429 0.304 0.066

25000 1.007 0.738 0.144

50000 2.062 1.484 0.322

XOR EncodingXOR Encoding RS EncodingRS Encoding ComparisonComparison

Experimental Set-up

Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size

File Size: 2.5 * Bucket Size records

Bucket Size: PT 74% TT

0.659

0.640

0.686

0,608

Encoding Rate MB/sec

30

PB Creation Perf.Experimental Set-up

Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size

File Size: 2.5 * Bucket Size records

Bucket Size

Total Time (sec)

Processing Time

(sec)

Communication

Time (sec)

5000 0.193 0.149 0.035

10000 0.446 0.328 0.059

25000 1.053 0.766 0.153

50000 2.103 1.531 0.322


Bucket Size; PT 74% TT

0.673

0.674

0.713

0,618

Encoding Rate MB/sec

31

PB Creation Perf.


XOR Encoding Rate : 0.66 MB/sec

RS Encoding Rate : 0.673 MB/sec

XOR provides a

performance gain of 5%

in Processing Time

(0.02% in the Total

Time)

For Bucket Size = 50000

LH*RS

Bucket Recovery

33

Coordinator

Failure Detection

Are You Alive ? <UDP>

Data Buckets

Parity Buckets

Buckets’ Recovery

34

Coordinator

Waiting for Replies…

I am Alive ? <UDP>

Data Buckets

Parity Buckets

Buckets’ Recovery

35

Coordinator

Searching for 2 Spare DBs…

DBs Connected to The Blank DBs Multicast

Group

Wanna be a Spare DB ? <Multicast>

[Sender IP@, Your Entity#]

Buckets’ Recovery

36

Coordinator

Waiting for Replies …


Group

Start UDP Listening, Start TCP Listening,

Start Working Threads

Waiting for Confirmation, If Time-out elapsed

Cancel all

I would

I would

I would

Buckets’ Recovery

37

Coordinator

Spare DBs Selection


Group

Disconnect from Blank PBs Multicast Group

You are Selected <UDP>

Disconnect from Blank PBs Multicast Group

Cancellation

Buckets’ Recovery

38

Coordinator

Parity Buckets

Recover Buckets[Spares IP@s+Entity#s;…]

Buckets’ Recovery

Recovery Manager Determination

39

Data Bucke

ts

Parity Buckets

Recovery Manager

Spare DBs

Alive Buckets participating to

Recovery

Send me Records of rank in [r, r+slice-1]

<UDP> …

Buckets’ Recovery

Query Phase

40

Decoding Process

Recovered Records <TCP>

Data Bucke

ts

Parity Buckets

Recovery Manager

Spare DBs

Alive Buckets participating to

Recovery

Requested Buffer

<TCP> …

Buckets’ RecoveryReconstruction Phase

41

DBs Recovery Perf.

Experimental Set-up

File: 125 000 recs; Bucket: 31250 recs 3.125 MB

XOR DecodingXOR Decoding RS DecodingRS Decoding ComparisonComparison

Slice Total Time (sec)

Processing Time (sec)

Communication Time

(sec)

1250 0.750 0.291 0.433

3125 0.693 0.249 0.372

6250 0.667 0.260 0.360

15625 0.755 0.255 0.458

31250 0.734 0.271 0.448

Slice (from 4% to 100% of Bucket contents)

TT doesn’t vary a lot

0.72

42

DBs Recovery Perf.

Experimental Set-up





Communication Time

(sec)

1250 0.870 0.390 0.443

3125 0.867 0.375 0.375

6250 0.828 0.385 0.303

15625 0.854 0.375 0.433

31250 0.854 0.375 0.448



0.85

43

DBs Recovery Perf.

Experimental Set-up



1DB Recovery Time - XOR : 0.720 sec

XOR provides a

performance gain of 15%

in Total Time

1DB Recovery Time – RS : 0.855 sec

44

DBs Recovery Perf.

Experimental Set-up


Recover 2 DBs Recover 2 DBs Recover 3 DBsRecover 3 DBs



Communication Time

(sec)

1250 1.234 0.590 0.519

3125 1.172 0.599 0.400

6250 1.172 0.598 0.365

15625 1.146 0.609 0.443

31250 1.088 0.599 0.442



1.2

45

DBs Recovery Perf.

Experimental Set-up


Recover 2 DBs Recover 2 DBs Recover 3 DBsRecover 3 DBs



Communication Time

(sec)

1250 1.589 0.922 0.522

3125 1.599 0.928 0.383

6250 1.541 0.907 0.401

15625 1.578 0.891 0.520

31250 1.468 0.906 0.495



1.6

46

Perf. Summary of Bucket Recovery

1 DB (3.125 MB) in 0.7 sec (XOR)

4.46 MB/sec

1 DB (3.125 MB) in 0.85 sec (RS)

3.65 MB/sec

2 DBs (6.250 MB) in 1.2 sec (RS)

5.21 MB/sec

3 DBs (9,375 MB) in 1.6 sec (RS)

5.86 MB/sec

47

Conclusion

The conducted experiements show that:

Encoding/Decoding Optimization

Enhanced Bucket Architecture

Impact on performance

Good Recovery Performance

Finally, we improved the processing time of the RS decoding process by 4% to 8%

1DB is recovered in half a second

48

Conclusion

LH*RS

Mature Implementation

Many Optimization Iterations

Only SDDS with Scalable Availability

49

Future Work

Better Parity Update Propagation Strategy to PBsBetter Parity Update Propagation Strategy to PBs

Investigation of faster Encoding/ Decoding processesInvestigation of faster Encoding/ Decoding processes

50

References[Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp.109-106, June 1988.

[ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, http://www.faqs.org/rfcs/rfc793.html

[McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, http://secinf.net/info/nt/2000ip/tcpipimp.html

[Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp. 314-329. [Xu et al.,99] L. Xu & Jehoshua Bruck, X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p.272-276, 1999.

[Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX –Conf. On File and Storage Technologies, Avril 2004.

[Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp. 335-348.

[White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf

[Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR-95-048, 1995.

51

References (Ctnd.)

[Litwin & Schwarz, 00] W. Litwin & T. Schwarz, LH*RS: A High-Availability

Scalable Distributed Data Structure using Reed Solomon Codes, p.237-248, Proceedings of the ACM SIGMOD 2000.

[Karlesson et al., 96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag.

[Reed & Solomon, 60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, 1960.

[Plank, 97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp 995- 1012,

[Diéne, 01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine.

[Bennour, 00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine.

[Moussa] http://ceria.dauphine.fr/rim/rim.html http://ceria.dauphine.fr/rim/rim.html

More references: http://ceria.dauphine.fr/rim/biblio.pdf

53

Parity CalculusGalois FieldGalois Field

GF[2GF[288] ] 1 symbol is 1 byte || GF[2 1 symbol is 1 byte || GF[21616] ] 1 symbol is 2 1 symbol is 2 bytesbytes (+)(+)

GF[2GF[21616] vs. GF[2] vs. GF[288] reduces by 1/2 ] reduces by 1/2 the # of symbols, and the # of symbols, and

consequently number of consequently number of opertaions in the fieldopertaions in the field

(-)(-)

Multiplication Tables Sizes Multiplication Tables Sizes

New Generator MatrixNew Generator Matrix

1st Column of ‘1’s1st Column of ‘1’s

1st parity bucket executes XOR 1st parity bucket executes XOR calculus instead of RS calculus calculus instead of RS calculus gain performance in encoding of

20%

1st line of ‘1’s1st line of ‘1’sEach PB executes XOR calculus

for any update from the 1st DB of any group gain performance of

4% -measured for PB creation

Encoding & Decoding Encoding & Decoding HintsHints

EncodingEncoding

log pre-calculus of the P matrix coefficents improv. of 3.5%

DecodingDecodinglog pre-calculus of H-1 matrix coef. and b vector for multiple buckets recovery improv. from 4% to

8%

design & implementation of lh* rs : a h ighly- available distributed d ata structure

Documents