design & implementation of lh* rs : a h ighly- available distributed d ata structure

53
Design & Implementation of LH* RS : a Highly- Available Distributed Data Structure Rim Moussa Rim Moussa [email protected] [email protected] http://ceria.dauphine.fr/rim/ http://ceria.dauphine.fr/rim/ rim.html rim.html Thomas J.E. Thomas J.E. Schwartz Schwartz [email protected] [email protected] http://www.cse.scu.edu/~tschwarz/ http://www.cse.scu.edu/~tschwarz/ homepage/thomas_schwarz.html homepage/thomas_schwarz.html Workshop in Distributed Data & Structures *July 2004

Upload: tivona

Post on 14-Jan-2016

35 views

Category:

Documents


0 download

DESCRIPTION

Workshop in Distributed Data & Structures * July 200 4. Design & Implementation of LH* RS : a H ighly- Available Distributed D ata Structure. Thomas J.E. Schwartz [email protected] http://www.cse.scu.edu/~tschwarz/homepage/thomas_schwarz.html. Rim Moussa [email protected] - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

Design & Implementation of LH*RS : a Highly- Available Distributed Data

Structure

Rim MoussaRim Moussa [email protected] [email protected]

http://ceria.dauphine.fr/rim/rim.htmlhttp://ceria.dauphine.fr/rim/rim.html

Thomas J.E. Thomas J.E.

[email protected] [email protected]

http://www.cse.scu.edu/~tschwarz/http://www.cse.scu.edu/~tschwarz/

homepage/thomas_schwarz.html homepage/thomas_schwarz.html

Workshop in Distributed Data & Structures

*July 2004

Page 2: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

2

Objective

Factors of Interest are Factors of Interest are : :

Parity OverheadParity Overhead

Recovery PerformancesRecovery Performances

LH*LH*RSRS

Design Design

Implementation Implementation

Performance Performance MeasurementsMeasurements

Page 3: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

3

1. Motivation

2. Highly-available schemes

3. LH*RS

4. Architectural Design

5. Hardware testbed

6. File Creation

7. High Availability

8. Recovery

9. Conclusion

10.Future Work

Scenario Description

Performance Results

Overview

Page 4: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

4

Motivation

Information Volume of 30% / year Bottleneck of disk access and CPUs Failures are frequent & costly

Business Operation

Industry Average Hourly Financial Impact

Brokerage (Retail) operations

Financial $6.45 million

Credit Card Sales Authorization

Financial $2.6 million

Airline Reservation

Centers

Transportation $89,500

Cellular (new) Service Activation

Communication $41,000

Source: Contingency Planning Research -1996Source: Contingency Planning Research -1996

Page 5: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

5

Requirements

NeedNeed

Highly Available Highly Available

Networked Data Networked Data

Storage SystemsStorage Systems

Scalability Scalability

High High Throughput Throughput

High AvailabilityHigh Availability

Page 6: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

6

SScalable & calable & DDistributed istributed DData ata SStructuretructure

Dynamic file growth

Client

Network

Client…

Data Buckets (DBs)

Coordinator

I’m Overloaded !

You Split !Records Transfered

Inserts

Page 7: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

7

SDDS (Ctnd.)

Network

No Centralized Directory Access

Data Buckets (DBs)

……

Client

Query Query Forwarded

Image Adjustement

Message

Page 8: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

8

Solutions towards High Availability

Parity Calculus

(+)(+) Good Response time since mirrors are queried Good Response time since mirrors are queried

(-)(-) High storage cost (High storage cost (n n if if nn repliquas) repliquas)

Data ReplicationData Replication

Erasure-resilient codes are evaluated regarding:Erasure-resilient codes are evaluated regarding:

Coding Rate (parity volume / data volume)Coding Rate (parity volume / data volume)

Update PenalityUpdate Penality

Group Size used for Data Reconstruction Group Size used for Data Reconstruction

Complexity of Coding & Decoding Complexity of Coding & Decoding

Page 9: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

9

Fault-Tolerant Schemes1 server failure

More than 1 server failure

Binary linear codes: [Hellerstein & al., 94]

Array Codes: EVENODD [Blaum et al., 94 ],

X-code [Xu et al.,99],

RDP schema [Corbett et al., 04]

Tolerate just 2 failures

Reed Solomon Codes:

IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH*RS[Litwin & Schwarz, 00]

Tolerate large number of failures

Simple XOR parity calculus : RAID Systems [Patterson et al., 88],

The SDDS LH*g [Litwin et al., 96]

Page 10: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

A Highly Available & Distributed Data Structure: LH*RS

[Litwin & Schwarz, 00][Litwin, Moussa & Schwarz, sub.]

Page 11: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

11

LH*RS

SDDSSDDSData Distribution scheme based on Linear Hashing : LH*LH [Karlesson et al., 96] applied to the key-field

Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63]

Scalability Scalability

High High Throughput Throughput

High AvailabilityHigh Availability

Page 12: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

12

LH*RS File Structure

Data Buckets

Parity Buckets

: Key Data Field

Insert Rank r

: Rank [Key List] Parity Field

Key r

2

1

0

2

1

0

Page 13: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

Architectural Design of LH*RS

Page 14: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

14

Communication

Use of TCP/IPUse of TCP/IP

New PB CreationNew PB Creation

Large Update Transfer (DB split)Large Update Transfer (DB split)

Bucket Recovery Bucket Recovery

Use of UDPUse of UDP

Individual Insert/ Update/ Delete/ Individual Insert/ Update/ Delete/ Search QueriesSearch Queries

Record RecoveryRecord Recovery

Service and Control MessagesService and Control Messages

SpeedSpeed

Better Better Performance & Performance &

Reliability Reliability than UDPthan UDP

Page 15: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

15

Network

Multicast Listening Port

Send UDP Port

Message Queue

-Message processing-

TCP/IP Port

Process Buffer

Recv UDP Port

Message Queue

-Message processing-

……

Message

TCP Connecti

on

WindowFree Zones

Sending Credit

Messages waiting for ack

…Not ack’ed messages

Multicast

Listening

Thread

Multicast

Working

Thread

Ack.

Management

Thread

UDP Listening

Thread TCP Listening

Thread

Work. Thread

1Work. Thread

n

Bucket Architecture

Page 16: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

16

Architectural Design

TCP/IP Connection Handler

TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00]

Enhancements to SDDS2000 [B00, D01] Bucket Enhancements to SDDS2000 [B00, D01] Bucket

ArchitectureArchitecture

Flow Control and Acknowledgement Mgmt.

Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88] [Diène, 01]

Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 sNew Architecture: 2.6 s Improv. 60%

(Hardware config.: 733MhZ machines, 100Mbps network)

Before

Page 17: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

17

Architectural Design (Ctnd.)

A pre-defined & static

IP @s Table

Dynamic IP@ Structure

Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe

DBs, PBsDBs, PBs

Coordinator

Blank DBs Multicast

Group

Blank PBs Multicast

Group

Multicast Component

Befor

e

Page 18: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

18

Hardware Testbed

5 Machines (Pentium IV: 5 Machines (Pentium IV: 1.8 GHz1.8 GHz, RAM: 512 Mb) , RAM: 512 Mb)

Ethernet Network: max bandwidth of Ethernet Network: max bandwidth of 1 Gbps1 Gbps

Operating System: Windows 2K ServerOperating System: Windows 2K Server

Tested configurationTested configuration 1 Client1 Client A group of 4 Data BucketsA group of 4 Data Buckets k k Parity Buckets, Parity Buckets, k k {0, 1, 2} {0, 1, 2}

Page 19: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

LH*RS

File Creation

Page 20: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

20

File CreationClient Operation Client Operation

Splitting Data BucketSplitting Data Bucket

PBs :PBs : (Records that Remain) (Records that Remain) NN Deletes -from old rank & Deletes -from old rank & NN Inserts -at new rank + Inserts -at new rank + (Records that move)(Records that move) N N DeletesDeletes

New Data BucketNew Data Bucket

PBs:PBs: NN Inserts (Moved Records) Inserts (Moved Records)

All Updates are gathered in the same buffer and transferred All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the (TCP/IP) simultaneously to respective Parity Buckets of the

Splitting DB Splitting DB & & New DBNew DB..

Propagation of each Insert/ Update/ Delete on Data Record to Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Parity Buckets

Data Bucket SplitData Bucket Split

Page 21: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

21

File Creation Perf.Experiments Set-up

File of 25 000 data records; 1 data record = 104 B

7,896s9,990s

10,963s

0,0002,0004,0006,0008,000

10,00012,000

0 5000 10000 15000 20000 25000

Inserted Keys

File

Cre

atio

n Ti

me

(sec

)

k = 0k = 1k = 2

Client Sending Credit = 1Client Sending Credit = 1 Client Sending Credit = Client Sending Credit = 55

kk = 1 to = 1 to kk = 2 = 2

Perf. Perf.

Degradation of Degradation of

8%8%

PB Overhead

kk = 0 to = 0 to kk = 1 = 1

Perf. Perf.

Degradation of Degradation of

20%20%

Page 22: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

22

File Creation Perf.Experimental Set-up

File of 25 000 data records; 1 data record = 104 B

Client Sending Credit = 1Client Sending Credit = 1 Client Sending Credit = Client Sending Credit = 55

kk = 1 to = 1 to kk = 2 = 2

Perf. Perf.

Degradation of Degradation of

10%10%

PB Overhead

4,349s

6,940s7,720s

0

2

4

6

8

10

0 5000 10000 15000 20000 25000

Number of Inserted Keys

File

Cre

atio

n Ti

me

(sec

)

k = 0k = 1k = 2

kk = 0 to = 0 to kk = 1 = 1

Perf. Perf.

Degradation of Degradation of

37%37%

Page 23: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

LH*RS

Parity Bucket Creation

Page 24: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

24

PB Creation Scenario

Coordinator

PBs Connected to The Blank PBs Multicast

Group

Wanna join group g ? <Multicast>

[Sender IP@+Entity#, Your Entity#]

Searching for a new PB

Page 25: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

25

PB Creation Scenario

Coordinator

PBs Connected to The Blank PBs Multicast

Group

I would

I would

I would

Start UDP Listening, Start TCP Listening,

Start Working Threads

Waiting for Confirmation, If Time-out elapsed

Cancel all

Waiting for Replies

Page 26: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

26

PB Creation Scenario

Coordinator

PBs Connected to The Blank PBs Multicast

Group

You are Selected <UDP> Disconnect from Blank

PBs Multicast Group

Cancellation

CancellationPB Selection

Page 27: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

27

Send me your contents ! <UDP>…

PB Creation Scenario

Data Bucket’s group

New PB

Auto-creation -Query phase

Page 28: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

28

Requested Buffer <TCP>…

PB Creation Scenario

Data Bucket’s group

New PB

Buffer Processing

Auto-creation –Encoding phase

Page 29: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

29

PB Creation Perf.

Bucket Size

Total Time (sec)

Processing Time

(sec)

Communication

Time (sec)

5000 0.190 0.140 0.029

10000 0.429 0.304 0.066

25000 1.007 0.738 0.144

50000 2.062 1.484 0.322

XOR EncodingXOR Encoding RS EncodingRS Encoding ComparisonComparison

Experimental Set-up

Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size

File Size: 2.5 * Bucket Size records

Bucket Size: PT 74% TT

0.659

0.640

0.686

0,608

Encoding Rate MB/sec

Page 30: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

30

PB Creation Perf.Experimental Set-up

Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size

File Size: 2.5 * Bucket Size records

Bucket Size

Total Time (sec)

Processing Time

(sec)

Communication

Time (sec)

5000 0.193 0.149 0.035

10000 0.446 0.328 0.059

25000 1.053 0.766 0.153

50000 2.103 1.531 0.322

XOR EncodingXOR Encoding RS EncodingRS Encoding ComparisonComparison

Bucket Size; PT 74% TT

0.673

0.674

0.713

0,618

Encoding Rate MB/sec

Page 31: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

31

PB Creation Perf.

XOR EncodingXOR Encoding RS EncodingRS Encoding ComparisonComparison

XOR Encoding Rate : 0.66 MB/sec

RS Encoding Rate : 0.673 MB/sec

XOR provides a

performance gain of 5%

in Processing Time

(0.02% in the Total

Time)

For Bucket Size = 50000

Page 32: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

LH*RS

Bucket Recovery

Page 33: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

33

Coordinator

Failure Detection

Are You Alive ? <UDP>

Data Buckets

Parity Buckets

Buckets’ Recovery

Page 34: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

34

Coordinator

Waiting for Replies…

I am Alive ? <UDP>

Data Buckets

Parity Buckets

Buckets’ Recovery

Page 35: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

35

Coordinator

Searching for 2 Spare DBs…

DBs Connected to The Blank DBs Multicast

Group

Wanna be a Spare DB ? <Multicast>

[Sender IP@, Your Entity#]

Buckets’ Recovery

Page 36: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

36

Coordinator

Waiting for Replies …

DBs Connected to The Blank DBs Multicast

Group

Start UDP Listening, Start TCP Listening,

Start Working Threads

Waiting for Confirmation, If Time-out elapsed

Cancel all

I would

I would

I would

Buckets’ Recovery

Page 37: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

37

Coordinator

Spare DBs Selection

DBs Connected to The Blank DBs Multicast

Group

Disconnect from Blank PBs Multicast Group

You are Selected <UDP>

Disconnect from Blank PBs Multicast Group

Cancellation

Buckets’ Recovery

Page 38: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

38

Coordinator

Parity Buckets

Recover Buckets[Spares IP@s+Entity#s;…]

Buckets’ Recovery

Recovery Manager Determination

Page 39: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

39

Data Bucke

ts

Parity Buckets

Recovery Manager

Spare DBs

Alive Buckets participating to

Recovery

Send me Records of rank in [r, r+slice-1]

<UDP> …

Buckets’ Recovery

Query Phase

Page 40: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

40

Decoding Process

Recovered Records <TCP>

Data Bucke

ts

Parity Buckets

Recovery Manager

Spare DBs

Alive Buckets participating to

Recovery

Requested Buffer

<TCP> …

Buckets’ RecoveryReconstruction Phase

Page 41: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

41

DBs Recovery Perf.

Experimental Set-up

File: 125 000 recs; Bucket: 31250 recs 3.125 MB

XOR DecodingXOR Decoding RS DecodingRS Decoding ComparisonComparison

Slice Total Time (sec)

Processing Time (sec)

Communication Time

(sec)

1250 0.750 0.291 0.433

3125 0.693 0.249 0.372

6250 0.667 0.260 0.360

15625 0.755 0.255 0.458

31250 0.734 0.271 0.448

Slice (from 4% to 100% of Bucket contents)

TT doesn’t vary a lot

0.72

Page 42: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

42

DBs Recovery Perf.

Experimental Set-up

File: 125 000 recs; Bucket: 31250 recs 3.125 MB

XOR DecodingXOR Decoding RS DecodingRS Decoding ComparisonComparison

Slice Total Time (sec)

Processing Time (sec)

Communication Time

(sec)

1250 0.870 0.390 0.443

3125 0.867 0.375 0.375

6250 0.828 0.385 0.303

15625 0.854 0.375 0.433

31250 0.854 0.375 0.448

Slice (from 4% to 100% of Bucket contents)

TT doesn’t vary a lot

0.85

Page 43: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

43

DBs Recovery Perf.

Experimental Set-up

File: 125 000 recs; Bucket: 31250 recs 3.125 MB

XOR DecodingXOR Decoding RS DecodingRS Decoding ComparisonComparison

1DB Recovery Time - XOR : 0.720 sec

XOR provides a

performance gain of 15%

in Total Time

1DB Recovery Time – RS : 0.855 sec

Page 44: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

44

DBs Recovery Perf.

Experimental Set-up

File: 125 000 recs; Bucket: 31250 recs 3.125 MB

Recover 2 DBs Recover 2 DBs Recover 3 DBsRecover 3 DBs

Slice Total Time (sec)

Processing Time (sec)

Communication Time

(sec)

1250 1.234 0.590 0.519

3125 1.172 0.599 0.400

6250 1.172 0.598 0.365

15625 1.146 0.609 0.443

31250 1.088 0.599 0.442

Slice (from 4% to 100% of Bucket contents)

TT doesn’t vary a lot

1.2

Page 45: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

45

DBs Recovery Perf.

Experimental Set-up

File: 125 000 recs; Bucket: 31250 recs 3.125 MB

Recover 2 DBs Recover 2 DBs Recover 3 DBsRecover 3 DBs

Slice Total Time (sec)

Processing Time (sec)

Communication Time

(sec)

1250 1.589 0.922 0.522

3125 1.599 0.928 0.383

6250 1.541 0.907 0.401

15625 1.578 0.891 0.520

31250 1.468 0.906 0.495

Slice (from 4% to 100% of Bucket contents)

TT doesn’t vary a lot

1.6

Page 46: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

46

Perf. Summary of Bucket Recovery

1 DB (3.125 MB) in 0.7 sec (XOR)

4.46 MB/sec

1 DB (3.125 MB) in 0.85 sec (RS)

3.65 MB/sec

2 DBs (6.250 MB) in 1.2 sec (RS)

5.21 MB/sec

3 DBs (9,375 MB) in 1.6 sec (RS)

5.86 MB/sec

Page 47: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

47

Conclusion

The conducted experiements show that:

Encoding/Decoding Optimization

Enhanced Bucket Architecture

Impact on performance

Good Recovery Performance

Finally, we improved the processing time of the RS decoding process by 4% to 8%

1DB is recovered in half a second

Page 48: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

48

Conclusion

LH*RS

Mature Implementation

Many Optimization Iterations

Only SDDS with Scalable Availability

Page 49: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

49

Future Work

Better Parity Update Propagation Strategy to PBsBetter Parity Update Propagation Strategy to PBs

Investigation of faster Encoding/ Decoding processesInvestigation of faster Encoding/ Decoding processes

Page 50: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

50

References[Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp.109-106, June 1988.

[ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, http://www.faqs.org/rfcs/rfc793.html

[McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, http://secinf.net/info/nt/2000ip/tcpipimp.html

[Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp. 314-329. [Xu et al.,99] L. Xu & Jehoshua Bruck, X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p.272-276, 1999.

[Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX –Conf. On File and Storage Technologies, Avril 2004.

[Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp. 335-348.

[White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf

[Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR-95-048, 1995.

Page 51: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

51

References (Ctnd.)

[Litwin & Schwarz, 00] W. Litwin & T. Schwarz, LH*RS: A High-Availability

Scalable Distributed Data Structure using Reed Solomon Codes, p.237-248, Proceedings of the ACM SIGMOD 2000.

[Karlesson et al., 96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag.

[Reed & Solomon, 60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, 1960. 

[Plank, 97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp 995- 1012,

[Diéne, 01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine.

[Bennour, 00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine.

[Moussa] http://ceria.dauphine.fr/rim/rim.html http://ceria.dauphine.fr/rim/rim.html

More references: http://ceria.dauphine.fr/rim/biblio.pdf

Page 52: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

End

Page 53: Design & Implementation of  LH* RS  : a H ighly- Available Distributed  D ata Structure

53

Parity CalculusGalois FieldGalois Field

GF[2GF[288] ] 1 symbol is 1 byte || GF[2 1 symbol is 1 byte || GF[21616] ] 1 symbol is 2 1 symbol is 2 bytesbytes (+)(+)

GF[2GF[21616] vs. GF[2] vs. GF[288] reduces by 1/2 ] reduces by 1/2 the # of symbols, and the # of symbols, and

consequently number of consequently number of opertaions in the fieldopertaions in the field

(-)(-)

Multiplication Tables Sizes Multiplication Tables Sizes

New Generator MatrixNew Generator Matrix

1st Column of ‘1’s1st Column of ‘1’s

1st parity bucket executes XOR 1st parity bucket executes XOR calculus instead of RS calculus calculus instead of RS calculus gain performance in encoding of

20%

1st line of ‘1’s1st line of ‘1’sEach PB executes XOR calculus

for any update from the 1st DB of any group gain performance of

4% -measured for PB creation

Encoding & Decoding Encoding & Decoding HintsHints

EncodingEncoding

log pre-calculus of the P matrix coefficents improv. of 3.5%

DecodingDecodinglog pre-calculus of H-1 matrix coef. and b vector for multiple buckets recovery improv. from 4% to

8%