design & implementation of lh* rs : a h ighly- available distributed d ata structure
DESCRIPTION
Workshop in Distributed Data & Structures * July 200 4. Design & Implementation of LH* RS : a H ighly- Available Distributed D ata Structure. Thomas J.E. Schwartz [email protected] http://www.cse.scu.edu/~tschwarz/homepage/thomas_schwarz.html. Rim Moussa [email protected] - PowerPoint PPT PresentationTRANSCRIPT
Design & Implementation of LH*RS : a Highly- Available Distributed Data
Structure
Rim MoussaRim Moussa [email protected] [email protected]
http://ceria.dauphine.fr/rim/rim.htmlhttp://ceria.dauphine.fr/rim/rim.html
Thomas J.E. Thomas J.E.
[email protected] [email protected]
http://www.cse.scu.edu/~tschwarz/http://www.cse.scu.edu/~tschwarz/
homepage/thomas_schwarz.html homepage/thomas_schwarz.html
Workshop in Distributed Data & Structures
*July 2004
2
Objective
Factors of Interest are Factors of Interest are : :
Parity OverheadParity Overhead
Recovery PerformancesRecovery Performances
LH*LH*RSRS
Design Design
Implementation Implementation
Performance Performance MeasurementsMeasurements
3
1. Motivation
2. Highly-available schemes
3. LH*RS
4. Architectural Design
5. Hardware testbed
6. File Creation
7. High Availability
8. Recovery
9. Conclusion
10.Future Work
Scenario Description
Performance Results
Overview
4
Motivation
Information Volume of 30% / year Bottleneck of disk access and CPUs Failures are frequent & costly
Business Operation
Industry Average Hourly Financial Impact
Brokerage (Retail) operations
Financial $6.45 million
Credit Card Sales Authorization
Financial $2.6 million
Airline Reservation
Centers
Transportation $89,500
Cellular (new) Service Activation
Communication $41,000
Source: Contingency Planning Research -1996Source: Contingency Planning Research -1996
5
Requirements
NeedNeed
Highly Available Highly Available
Networked Data Networked Data
Storage SystemsStorage Systems
Scalability Scalability
High High Throughput Throughput
High AvailabilityHigh Availability
6
SScalable & calable & DDistributed istributed DData ata SStructuretructure
Dynamic file growth
Client
Network
Client…
Data Buckets (DBs)
…
Coordinator
I’m Overloaded !
You Split !Records Transfered
Inserts
…
7
SDDS (Ctnd.)
Network
No Centralized Directory Access
Data Buckets (DBs)
……
Client
Query Query Forwarded
Image Adjustement
Message
…
8
Solutions towards High Availability
Parity Calculus
(+)(+) Good Response time since mirrors are queried Good Response time since mirrors are queried
(-)(-) High storage cost (High storage cost (n n if if nn repliquas) repliquas)
Data ReplicationData Replication
Erasure-resilient codes are evaluated regarding:Erasure-resilient codes are evaluated regarding:
Coding Rate (parity volume / data volume)Coding Rate (parity volume / data volume)
Update PenalityUpdate Penality
Group Size used for Data Reconstruction Group Size used for Data Reconstruction
Complexity of Coding & Decoding Complexity of Coding & Decoding
9
Fault-Tolerant Schemes1 server failure
More than 1 server failure
Binary linear codes: [Hellerstein & al., 94]
Array Codes: EVENODD [Blaum et al., 94 ],
X-code [Xu et al.,99],
RDP schema [Corbett et al., 04]
Tolerate just 2 failures
Reed Solomon Codes:
IDA [Rabin, 89], RAID X [White, 91], FEC [Blomer et al., 95], Tutorial [Plank, 97], LH*RS[Litwin & Schwarz, 00]
Tolerate large number of failures
…
Simple XOR parity calculus : RAID Systems [Patterson et al., 88],
The SDDS LH*g [Litwin et al., 96]
A Highly Available & Distributed Data Structure: LH*RS
[Litwin & Schwarz, 00][Litwin, Moussa & Schwarz, sub.]
11
LH*RS
SDDSSDDSData Distribution scheme based on Linear Hashing : LH*LH [Karlesson et al., 96] applied to the key-field
Parity Calculus Reed-Solomon Codes [Reed & Solomon, 63]
Scalability Scalability
High High Throughput Throughput
High AvailabilityHigh Availability
12
LH*RS File Structure
Data Buckets
Parity Buckets
: Key Data Field
Insert Rank r
: Rank [Key List] Parity Field
Key r
2
1
0
2
1
0
Architectural Design of LH*RS
14
Communication
Use of TCP/IPUse of TCP/IP
New PB CreationNew PB Creation
Large Update Transfer (DB split)Large Update Transfer (DB split)
Bucket Recovery Bucket Recovery
Use of UDPUse of UDP
Individual Insert/ Update/ Delete/ Individual Insert/ Update/ Delete/ Search QueriesSearch Queries
Record RecoveryRecord Recovery
Service and Control MessagesService and Control Messages
SpeedSpeed
Better Better Performance & Performance &
Reliability Reliability than UDPthan UDP
15
Network
Multicast Listening Port
Send UDP Port
Message Queue
-Message processing-
TCP/IP Port
Process Buffer
Recv UDP Port
Message Queue
-Message processing-
……
Message
TCP Connecti
on
WindowFree Zones
Sending Credit
Messages waiting for ack
…Not ack’ed messages
Multicast
Listening
Thread
Multicast
Working
Thread
Ack.
Management
Thread
UDP Listening
Thread TCP Listening
Thread
Work. Thread
1Work. Thread
n
Bucket Architecture
16
Architectural Design
TCP/IP Connection Handler
TCP/IP connections are passive OPEN, RFC 793 –[ISI,81],TCP/IP Implem. under Win2K Server O.S. [McDonal & Barkley, 00]
Enhancements to SDDS2000 [B00, D01] Bucket Enhancements to SDDS2000 [B00, D01] Bucket
ArchitectureArchitecture
Flow Control and Acknowledgement Mgmt.
Principle of “Sending Credit + Message conservation until delivery” [Jacobson, 88] [Diène, 01]
Ex.: 1 DB recovery: SDDS 2000 Architecture: 6.7 sNew Architecture: 2.6 s Improv. 60%
(Hardware config.: 733MhZ machines, 100Mbps network)
Before
17
Architectural Design (Ctnd.)
A pre-defined & static
IP @s Table
Dynamic IP@ Structure
Updated when adding new/spare Buckets (PBs/DBs) through Multicast Probe
DBs, PBsDBs, PBs
Coordinator
Blank DBs Multicast
Group
Blank PBs Multicast
Group
Multicast Component
Befor
e
18
Hardware Testbed
5 Machines (Pentium IV: 5 Machines (Pentium IV: 1.8 GHz1.8 GHz, RAM: 512 Mb) , RAM: 512 Mb)
Ethernet Network: max bandwidth of Ethernet Network: max bandwidth of 1 Gbps1 Gbps
Operating System: Windows 2K ServerOperating System: Windows 2K Server
Tested configurationTested configuration 1 Client1 Client A group of 4 Data BucketsA group of 4 Data Buckets k k Parity Buckets, Parity Buckets, k k {0, 1, 2} {0, 1, 2}
LH*RS
File Creation
20
File CreationClient Operation Client Operation
Splitting Data BucketSplitting Data Bucket
PBs :PBs : (Records that Remain) (Records that Remain) NN Deletes -from old rank & Deletes -from old rank & NN Inserts -at new rank + Inserts -at new rank + (Records that move)(Records that move) N N DeletesDeletes
New Data BucketNew Data Bucket
PBs:PBs: NN Inserts (Moved Records) Inserts (Moved Records)
All Updates are gathered in the same buffer and transferred All Updates are gathered in the same buffer and transferred (TCP/IP) simultaneously to respective Parity Buckets of the (TCP/IP) simultaneously to respective Parity Buckets of the
Splitting DB Splitting DB & & New DBNew DB..
Propagation of each Insert/ Update/ Delete on Data Record to Propagation of each Insert/ Update/ Delete on Data Record to Parity Buckets Parity Buckets
Data Bucket SplitData Bucket Split
21
File Creation Perf.Experiments Set-up
File of 25 000 data records; 1 data record = 104 B
7,896s9,990s
10,963s
0,0002,0004,0006,0008,000
10,00012,000
0 5000 10000 15000 20000 25000
Inserted Keys
File
Cre
atio
n Ti
me
(sec
)
k = 0k = 1k = 2
Client Sending Credit = 1Client Sending Credit = 1 Client Sending Credit = Client Sending Credit = 55
kk = 1 to = 1 to kk = 2 = 2
Perf. Perf.
Degradation of Degradation of
8%8%
PB Overhead
kk = 0 to = 0 to kk = 1 = 1
Perf. Perf.
Degradation of Degradation of
20%20%
22
File Creation Perf.Experimental Set-up
File of 25 000 data records; 1 data record = 104 B
Client Sending Credit = 1Client Sending Credit = 1 Client Sending Credit = Client Sending Credit = 55
kk = 1 to = 1 to kk = 2 = 2
Perf. Perf.
Degradation of Degradation of
10%10%
PB Overhead
4,349s
6,940s7,720s
0
2
4
6
8
10
0 5000 10000 15000 20000 25000
Number of Inserted Keys
File
Cre
atio
n Ti
me
(sec
)
k = 0k = 1k = 2
kk = 0 to = 0 to kk = 1 = 1
Perf. Perf.
Degradation of Degradation of
37%37%
LH*RS
Parity Bucket Creation
24
PB Creation Scenario
Coordinator
PBs Connected to The Blank PBs Multicast
Group
Wanna join group g ? <Multicast>
[Sender IP@+Entity#, Your Entity#]
Searching for a new PB
25
PB Creation Scenario
Coordinator
PBs Connected to The Blank PBs Multicast
Group
I would
I would
I would
Start UDP Listening, Start TCP Listening,
Start Working Threads
Waiting for Confirmation, If Time-out elapsed
Cancel all
Waiting for Replies
26
PB Creation Scenario
Coordinator
PBs Connected to The Blank PBs Multicast
Group
You are Selected <UDP> Disconnect from Blank
PBs Multicast Group
Cancellation
CancellationPB Selection
27
Send me your contents ! <UDP>…
PB Creation Scenario
Data Bucket’s group
New PB
…
Auto-creation -Query phase
28
Requested Buffer <TCP>…
PB Creation Scenario
Data Bucket’s group
New PB
Buffer Processing
…
Auto-creation –Encoding phase
29
PB Creation Perf.
Bucket Size
Total Time (sec)
Processing Time
(sec)
Communication
Time (sec)
5000 0.190 0.140 0.029
10000 0.429 0.304 0.066
25000 1.007 0.738 0.144
50000 2.062 1.484 0.322
XOR EncodingXOR Encoding RS EncodingRS Encoding ComparisonComparison
Experimental Set-up
Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size
File Size: 2.5 * Bucket Size records
Bucket Size: PT 74% TT
0.659
0.640
0.686
0,608
Encoding Rate MB/sec
30
PB Creation Perf.Experimental Set-up
Bucket Size : 5000 .. 50000 records; Bucket Contents = 0.625* Bucket Size
File Size: 2.5 * Bucket Size records
Bucket Size
Total Time (sec)
Processing Time
(sec)
Communication
Time (sec)
5000 0.193 0.149 0.035
10000 0.446 0.328 0.059
25000 1.053 0.766 0.153
50000 2.103 1.531 0.322
XOR EncodingXOR Encoding RS EncodingRS Encoding ComparisonComparison
Bucket Size; PT 74% TT
0.673
0.674
0.713
0,618
Encoding Rate MB/sec
31
PB Creation Perf.
XOR EncodingXOR Encoding RS EncodingRS Encoding ComparisonComparison
XOR Encoding Rate : 0.66 MB/sec
RS Encoding Rate : 0.673 MB/sec
XOR provides a
performance gain of 5%
in Processing Time
(0.02% in the Total
Time)
For Bucket Size = 50000
LH*RS
Bucket Recovery
33
Coordinator
Failure Detection
Are You Alive ? <UDP>
Data Buckets
Parity Buckets
Buckets’ Recovery
34
Coordinator
Waiting for Replies…
I am Alive ? <UDP>
Data Buckets
Parity Buckets
Buckets’ Recovery
35
Coordinator
Searching for 2 Spare DBs…
DBs Connected to The Blank DBs Multicast
Group
Wanna be a Spare DB ? <Multicast>
[Sender IP@, Your Entity#]
Buckets’ Recovery
36
Coordinator
Waiting for Replies …
DBs Connected to The Blank DBs Multicast
Group
Start UDP Listening, Start TCP Listening,
Start Working Threads
Waiting for Confirmation, If Time-out elapsed
Cancel all
I would
I would
I would
Buckets’ Recovery
37
Coordinator
Spare DBs Selection
DBs Connected to The Blank DBs Multicast
Group
Disconnect from Blank PBs Multicast Group
You are Selected <UDP>
Disconnect from Blank PBs Multicast Group
Cancellation
Buckets’ Recovery
38
Coordinator
Parity Buckets
Recover Buckets[Spares IP@s+Entity#s;…]
Buckets’ Recovery
Recovery Manager Determination
39
Data Bucke
ts
Parity Buckets
Recovery Manager
Spare DBs
Alive Buckets participating to
Recovery
Send me Records of rank in [r, r+slice-1]
<UDP> …
Buckets’ Recovery
Query Phase
40
Decoding Process
Recovered Records <TCP>
Data Bucke
ts
Parity Buckets
Recovery Manager
Spare DBs
Alive Buckets participating to
Recovery
Requested Buffer
<TCP> …
Buckets’ RecoveryReconstruction Phase
41
DBs Recovery Perf.
Experimental Set-up
File: 125 000 recs; Bucket: 31250 recs 3.125 MB
XOR DecodingXOR Decoding RS DecodingRS Decoding ComparisonComparison
Slice Total Time (sec)
Processing Time (sec)
Communication Time
(sec)
1250 0.750 0.291 0.433
3125 0.693 0.249 0.372
6250 0.667 0.260 0.360
15625 0.755 0.255 0.458
31250 0.734 0.271 0.448
Slice (from 4% to 100% of Bucket contents)
TT doesn’t vary a lot
0.72
42
DBs Recovery Perf.
Experimental Set-up
File: 125 000 recs; Bucket: 31250 recs 3.125 MB
XOR DecodingXOR Decoding RS DecodingRS Decoding ComparisonComparison
Slice Total Time (sec)
Processing Time (sec)
Communication Time
(sec)
1250 0.870 0.390 0.443
3125 0.867 0.375 0.375
6250 0.828 0.385 0.303
15625 0.854 0.375 0.433
31250 0.854 0.375 0.448
Slice (from 4% to 100% of Bucket contents)
TT doesn’t vary a lot
0.85
43
DBs Recovery Perf.
Experimental Set-up
File: 125 000 recs; Bucket: 31250 recs 3.125 MB
XOR DecodingXOR Decoding RS DecodingRS Decoding ComparisonComparison
1DB Recovery Time - XOR : 0.720 sec
XOR provides a
performance gain of 15%
in Total Time
1DB Recovery Time – RS : 0.855 sec
44
DBs Recovery Perf.
Experimental Set-up
File: 125 000 recs; Bucket: 31250 recs 3.125 MB
Recover 2 DBs Recover 2 DBs Recover 3 DBsRecover 3 DBs
Slice Total Time (sec)
Processing Time (sec)
Communication Time
(sec)
1250 1.234 0.590 0.519
3125 1.172 0.599 0.400
6250 1.172 0.598 0.365
15625 1.146 0.609 0.443
31250 1.088 0.599 0.442
Slice (from 4% to 100% of Bucket contents)
TT doesn’t vary a lot
1.2
45
DBs Recovery Perf.
Experimental Set-up
File: 125 000 recs; Bucket: 31250 recs 3.125 MB
Recover 2 DBs Recover 2 DBs Recover 3 DBsRecover 3 DBs
Slice Total Time (sec)
Processing Time (sec)
Communication Time
(sec)
1250 1.589 0.922 0.522
3125 1.599 0.928 0.383
6250 1.541 0.907 0.401
15625 1.578 0.891 0.520
31250 1.468 0.906 0.495
Slice (from 4% to 100% of Bucket contents)
TT doesn’t vary a lot
1.6
46
Perf. Summary of Bucket Recovery
1 DB (3.125 MB) in 0.7 sec (XOR)
4.46 MB/sec
1 DB (3.125 MB) in 0.85 sec (RS)
3.65 MB/sec
2 DBs (6.250 MB) in 1.2 sec (RS)
5.21 MB/sec
3 DBs (9,375 MB) in 1.6 sec (RS)
5.86 MB/sec
47
Conclusion
The conducted experiements show that:
Encoding/Decoding Optimization
Enhanced Bucket Architecture
Impact on performance
Good Recovery Performance
Finally, we improved the processing time of the RS decoding process by 4% to 8%
1DB is recovered in half a second
48
Conclusion
LH*RS
Mature Implementation
Many Optimization Iterations
Only SDDS with Scalable Availability
49
Future Work
Better Parity Update Propagation Strategy to PBsBetter Parity Update Propagation Strategy to PBs
Investigation of faster Encoding/ Decoding processesInvestigation of faster Encoding/ Decoding processes
50
References[Patterson et al., 88] D. A. Patterson, G. Gibson & R. H. Katz, A Case for Redundant Arrays of Inexpensive Disks, Proc. of ACM SIGMOD Conf, pp.109-106, June 1988.
[ISI,81] Information Sciences Institute, RFC 793: Transmission Control Protocol (TCP) – Specification, Sept. 1981, http://www.faqs.org/rfcs/rfc793.html
[McDonal & Barkley, 00] D. MacDonal, W. Barkley, MS Windows 2000 TCP/IP Implementation Details, http://secinf.net/info/nt/2000ip/tcpipimp.html
[Jacobson, 88] V. Jacobson, M. J. Karels, Congestion Avoidance and Control, Computer Communication Review, Vol. 18, No 4, pp. 314-329. [Xu et al.,99] L. Xu & Jehoshua Bruck, X-Code: MDS Array Codes with Optimal Encoding, IEEE Trans. on Information Theory, 45(1), p.272-276, 1999.
[Corbett et al., 04] P. Corbett, B. English, A. Goel, T. Grcanac, S. Kleiman, J. Leong, S. Sankar, Row-Diagonal Parity for Double Disk Failure Correction, Proc. of the 3rd USENIX –Conf. On File and Storage Technologies, Avril 2004.
[Rabin, 89] M. O. Rabin, Efficient Dispersal of Information for Security, Load Balancing and Fault Tolerance, Journal of ACM, Vol. 26, N° 2, April 1989, pp. 335-348.
[White, 91] P.E. White, RAID X tackles design problems with existing design RAID schemes, ECC Technologies, ftp://members.aol.com.mnecctek.ctr1991.pdf
[Blomer et al., 95] J. Blomer, M. Kalfane, R. Karp, M. Karpinski, M. Luby & D. Zuckerman, An XOR-Based Erasure-Resilient Coding Scheme, ICSI Tech. Rep. TR-95-048, 1995.
51
References (Ctnd.)
[Litwin & Schwarz, 00] W. Litwin & T. Schwarz, LH*RS: A High-Availability
Scalable Distributed Data Structure using Reed Solomon Codes, p.237-248, Proceedings of the ACM SIGMOD 2000.
[Karlesson et al., 96] J. Karlson, W. Litwin & T. Risch, LH*LH: A Scalable high performance data structure for switched multicomputers, EDBT 96, Springer Verlag.
[Reed & Solomon, 60] I. Reed & G. Solomon, Polynomial codes over certain Finite Fields, Journal of the society for industrial and applied mathematics, 1960.
[Plank, 97] J. S. Plank, A Tutorial on Reed-Solomon Coding for fault-Tolerance in RAID-like Systems, Software– Practise & Experience, 27(9), Sept. 1997, pp 995- 1012,
[Diéne, 01] A.W. Diène, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Nov. 2001, Université Paris Dauphine.
[Bennour, 00] F. Sahli Bennour, Contribution à la Gestion de Structures de Données Distribuées et Scalables, PhD Thesis, Juin 2000, Université Paris Dauphine.
[Moussa] http://ceria.dauphine.fr/rim/rim.html http://ceria.dauphine.fr/rim/rim.html
More references: http://ceria.dauphine.fr/rim/biblio.pdf
End
53
Parity CalculusGalois FieldGalois Field
GF[2GF[288] ] 1 symbol is 1 byte || GF[2 1 symbol is 1 byte || GF[21616] ] 1 symbol is 2 1 symbol is 2 bytesbytes (+)(+)
GF[2GF[21616] vs. GF[2] vs. GF[288] reduces by 1/2 ] reduces by 1/2 the # of symbols, and the # of symbols, and
consequently number of consequently number of opertaions in the fieldopertaions in the field
(-)(-)
Multiplication Tables Sizes Multiplication Tables Sizes
New Generator MatrixNew Generator Matrix
1st Column of ‘1’s1st Column of ‘1’s
1st parity bucket executes XOR 1st parity bucket executes XOR calculus instead of RS calculus calculus instead of RS calculus gain performance in encoding of
20%
1st line of ‘1’s1st line of ‘1’sEach PB executes XOR calculus
for any update from the 1st DB of any group gain performance of
4% -measured for PB creation
Encoding & Decoding Encoding & Decoding HintsHints
EncodingEncoding
log pre-calculus of the P matrix coefficents improv. of 3.5%
DecodingDecodinglog pre-calculus of H-1 matrix coef. and b vector for multiple buckets recovery improv. from 4% to
8%