efficient reliable broadcast for commodity clusters · 8 x intel mp1.4-complaint smp machines. each...

Raymond Wong and Raymond Wong and ChoCho--Li WangLi WangDepartment of Computer Science and Information SystemsDepartment of Computer Science and Information Systems

The University of Hong KongThe University of Hong Kong

Efficient Reliable Efficient Reliable Broadcast for Broadcast for

Commodity ClustersCommodity Clusters

What is a cluster?What is a cluster?

A cluster is a type of parallel or A cluster is a type of parallel or distributed processing system, distributed processing system, which consists of a collection of which consists of a collection of

interconnected interconnected standstand--alone alone computerscomputers cooperatively cooperatively

working together as a working together as a singlesingle, , integrated computing resource. integrated computing resource.

---- IEEE TFCCIEEE TFCC

33

Efficient Reliable BroadcastEfficient Reliable Broadcast

vvEfficient clustering requires efficient networking Efficient clustering requires efficient networking for tightly coupling all resources. for tightly coupling all resources. vvImproving network performance helps Improving network performance helps

improving performance in cluster computation.improving performance in cluster computation.vvEfficient Reliable BroadcastEfficient Reliable Broadcast : Let: Let s do s do all all

together together the most basic synchronization and the most basic synchronization and data movement operation.data movement operation.

44

Main objectivesMain objectives

§§ To achieve the fastest broadcast in a commodity SMP To achieve the fastest broadcast in a commodity SMP cluster connected by a network with hardware cluster connected by a network with hardware broadcast.broadcast.

§§ Reduce resource consumptions: Reduce resource consumptions: Computation: Computation: e.g., CPU cycles e.g., CPU cycles lowlow--latency latency Memory: Memory: e.g., send/receive buffers e.g., send/receive buffers NetworkNetwork: e.g., avoid redundant traffic: e.g., avoid redundant traffic

vv You can consider the collective operation as a You can consider the collective operation as a fatfatcommunication command. communication command.

55

OutlineOutline

vvBackgroundBackgroundvvPushPush--Pull MessagingPull MessagingvvHardware BroadcastHardware BroadcastvvPerformance EvaluationPerformance EvaluationvvConclusionsConclusions

66

Tackling the Problem...Tackling the Problem...

§§ Theoretical broadcast studies have focused on Theoretical broadcast studies have focused on the delivery strategy of packets based on some the delivery strategy of packets based on some abstract model abstract model

PostalPostal (1992) : A. Bar(1992) : A. Bar--NoyNoy [2][2]Lopsided TreesLopsided Trees (1997) : (1997) : GolinGolin et. al.et. al. , [7], [7]LogPLogP (1993), Karp, [9] (Also, (1993), Karp, [9] (Also, SubramonianSubramonian ss multiplemultiple--item item broadcast in broadcast in LogP LogP model)model)Star GraphStar Graph (1997): Y.C. Tseng, (1997): Y.C. Tseng, et. alet. al., [14].., [14].Hypercube, Mesh, Hypercube, Mesh, ToriTori: Survey paper [McKinley: 1995]: Survey paper [McKinley: 1995]

§§ Efficient in terms of complexity.Efficient in terms of complexity.§§ Could not be practically implemented.Could not be practically implemented.

77


§§Broadcast algorithms in Broadcast algorithms in messagemessagelevel:level:

IBM SP2IBM SP2 (MPL): (MPL): Abandah Abandah (U. of Michigan), [IPPS(U. of Michigan), [IPPS 96]96]InterComInterCom ProjectProject ((iCCiCC library, INTEL Paragon, 1995): library, INTEL Paragon, 1995): MitraMitra, , et., al. (short, long, hybrid)et., al. (short, long, hybrid)MPICH:MPICH: GroppGropp, et. Al, (linear, tree, et. Al, (linear, tree--based) [6]based) [6]

§§ To high level To high level good portabilitygood portability§§ Cannot take advantage of underlying system Cannot take advantage of underlying system

featuresfeatures

88


§§Hardware broadcast is efficient. We Hardware broadcast is efficient. We adopt it.adopt it.§§But research issues are,But research issues are,

How to utilize the hardware broadcast How to utilize the hardware broadcast operation in useroperation in user--level for efficient data level for efficient data movement?movement?Transferring broadcast packets is not Transferring broadcast packets is not reliable. How to make it reliable?reliable. How to make it reliable?Single Single fatfat packet for multiple nodes. packet for multiple nodes. What is the delivery strategy?What is the delivery strategy?

99

PushPush--Pull MessagingPull Messaging [17] : Concept[17] : Concept

PushPhase

Sender Receiver

Send BTP

Acknowledge

Send Rest

PullPhase

BTP: Bytes-to-Push(short message)

Now, prepare the buffer

(send the rest of message)

vv Allow better buffer Allow better buffer managementmanagement

vv Can avoid long latency Can avoid long latency by applying various by applying various latency hiding latency hiding techniques.techniques.

SenderSender ReceiverReceiver

1010

Main Data Structures in DPMain Data Structures in DP

§ (1) send queue stores pending send requests. send buffer stores the data. § (2) receive queue stores pending receive

requests. Packets received from the NIC are stored in receive buffer.§ (3) buffer queue and pushed buffer store

pending incoming packets where their destinations in memory are not determined. § Three queues can be accessed by both user

and kernel threads.

1111

PushPush--Pull Messaging: Pull Messaging: ArchitectureArchitecture

DPProcess

DPProc.

Send Process

Source Buffer

ReceptionHandler

User Space Kernel Space

Buffer Queue andPushed Buffer

Receive Queue

Send Queue

Incoming FIFO BufferQueue

Outgoing FIFO BufferQueue

ReceptionHandler

Kernel Space



1a

2b.1

Ack

3b1b.1

1b.2

4

Queues used by otherprocesses

DPProc.DP

Proc.

Queues used by other


1212

Architecture (Receive)Architecture (Receive)

ReceptionHandler

Kernel Space



Receive Process

Destination Buffer

ReceptionHandler

User SpaceKernel Space

Buffer Queue andPushed Buffer

Receive Queue

Send Queue



2a

2b.12b.2

Ack

3b

1b.2

4

Ack

3a

Queues used by other DPProcess

Queues used by otherprocesses

DPProc.DP

Proc.DPProc.


1313

Performance Optimization : DPPerformance Optimization : DP--SMP, 1999 ICPP [17]SMP, 1999 ICPP [17]

vvCrossCross--Space Zero BufferSpace Zero BuffervvAddress Translation Overhead MaskingAddress Translation Overhead MaskingvvPushPush--andand--Acknowledge OverlappingAcknowledge Overlapping

1414

DPDP--SMP PerformanceSMP Performance

§§ InternodeInternode: : (machine(machine--toto--machine)machine)

SingleSingle--trip latencytrip latency ((ALR 4ALR 4--way Pentium Pro. 200 MHz way Pentium Pro. 200 MHz SMP, 66 MHz system bus, backSMP, 66 MHz system bus, back--toto--backback) : ) : 30.130.1microseconds (microseconds (88--byte messagebyte message))Bandwidth:Bandwidth: 12.1 MB/s12.1 MB/s ((Digital DEC 21140A Fast Digital DEC 21140A Fast EthernetEthernet) at 40KB message) at 40KB message

§§ IntranodeIntranode: : (process(process--toto--process within the same node)process within the same node)

SingleSingle--trip latencytrip latency : : 7.57.5 microseconds (8microseconds (8--byte byte message)message)Bandwidth:Bandwidth: 350.9 MB/s350.9 MB/s at 40KB messageat 40KB message

1515

Directed PointDirected Point abstraction model [10]abstraction model [10]

NID 1

NID 2

NID 3

NID 4

DPID 1

DPID 1

DPID 1

DPID 1

DPID 2

DPID 2

DPProcess

DPProcess

DPProcess

DPProcess

DPProcess

NID: node IDNID: node ID

DPID: endpoint IDDPID: endpoint ID

1616

Broadcast:Broadcast: traditional hightraditional high--level level implementationimplementation

§§ Use a sequence of pointUse a sequence of point--toto--point communication. point communication. §§ Simple, but it cannot be fully optimized for the Simple, but it cannot be fully optimized for the

performance. performance. Reliable channels are maintained independently. Each Reliable channels are maintained independently. Each channel may keep transmission and reception buffers.channel may keep transmission and reception buffers.Poor scalability: the number of transmission and Poor scalability: the number of transmission and reception buffers in the root node increases as the size reception buffers in the root node increases as the size of the cluster increases. of the cluster increases. Extra synchronization overheads incur while switching Extra synchronization overheads incur while switching from channel to channel.from channel to channel.

1717

New Data StructuresNew Data Structures

§§ Enhanced Queue Architecture (EQA)Enhanced Queue Architecture (EQA)Allows multiple senders to share one single Allows multiple senders to share one single queue and buffer properly. queue and buffer properly. An entry in a queue and buffer could be An entry in a queue and buffer could be retrieved by many senders which linked to retrieved by many senders which linked to the queue and buffer. the queue and buffer.

§§ LightLight--weight Directed Point (LDP)weight Directed Point (LDP)LDP = DP without buffers. LDP = DP without buffers. LDP stores pointers which point to LDP stores pointers which point to appropriate BUF in a DP.appropriate BUF in a DP.

1818

Enhanced Queue ArchitectureEnhanced Queue Architecture

HMS

LMS

send buffer (BUF)

LMS

head+count

head+count

head+count

head

head

head

local_head

next

struct q_t

DP

struct q_local_t

DP

DP

struct q_local_t

DP

LDP1

LDP2NID1

NID2

NID3

NID4

Enhanced Queue ArchitectureEnhanced Queue Architecture

1919

Broadcast with EQABroadcast with EQA

NID1NID1

NID2NID2

NID3NID3

NID4NID4

HMSHMSHMS

BUFBUFBUFDP1DP1

LMSLMSLDP2LDP2

LMSLMS

LDP3LDP3

HMSHMSHMS

BUFBUFBUFDP1DP1

HMSHMSHMS

BUFBUFBUFDP1DP1

HMSHMSHMS

BUFBUFBUFDP1DP1

Root nodeRoot node

HMSHMSHMS Home management structureHome management structure

LMSLMSLMS Local management structureLocal management structure

Leaf nodeLeaf node

Leaf nodeLeaf node

Leaf nodeLeaf node

Hardware broadcastHardware broadcast

2020

Two HardwareTwo Hardware--based based Broadcast AlgorithmsBroadcast Algorithms

vv (1) Simple Broadcast.(1) Simple Broadcast.Packets are (H/W) broadcast one by one. Flow control: go-back-n protocol -- controlled by DP (HMS)Packets may be lost if the destination buffers are not allocated due to the late receive operation.Retransmission: Lost packets will be re-sent (use point-to-point operation) according to transmission records stored in LDP (LMS)

2121

Two HardwareTwo Hardware--based based Broadcast AlgorithmsBroadcast Algorithms

vv(2) Push(2) Push--Pull Broadcast.Pull Broadcast.Push phase: a portion of the broadcast message is pushed to all the leaf nodes.

¾ ONLY one DP would send acknowledge packets after finishing the push phase.

Pull phase: the source DP broadcasts the remaining packets to all DPs one by one.

¾¾ PointPoint--toto--point communication is used to repoint communication is used to re--send the lost send the lost packets during the pull phase based on a packets during the pull phase based on a go-back-n protocol. .

2222

Performance EvaluationPerformance Evaluation

§§ Cluster configuration:Cluster configuration:8 x Intel MP1.48 x Intel MP1.4--complaint SMP machines.complaint SMP machines.Each consisted of 2 IntelEach consisted of 2 Intel CeleronCeleron 450 MHz 450 MHz processors with 128 Mbytes memory.processors with 128 Mbytes memory.Connected by Fast Ethernet.Connected by Fast Ethernet.OS: Linux 2.2.1OS: Linux 2.2.1

vvBroadcast algorithms tested:Broadcast algorithms tested:vv Simple Broadcast (SBCAST) Simple Broadcast (SBCAST) vv PushPush--Pull Broadcast (PPBCAST)Pull Broadcast (PPBCAST)

2323

0

100

200

300

400

500

600

700

800

0 500 1000 1500 2000 2500 3000 3500 4000 4500Size (Bytes)

Late

ncy

(us)

PPBCAST/4

SBCAST/4

MPICH/4

Broadcast Latency Test

2424

Short Messages (256 bytes)

0

50

100

150

200

250

300

350

400

p=2 p=4 p=8Cluster Size

Late

ncy

(us)

PPBCAST

MPICH

Long Messages (4096 bytes)

0

100

200

300

400

500

600

700

800

900

p=2 p=4 p=8Cluster Size

PPBCAST

MPICH

La

ten

cy (

us)

Broadcast Latency Comparison

142142 s s

2525

Parallel Ray Tracing Parallel Ray Tracing vvUsing MPIPOV by ParMaUsing MPIPOV by ParMa2 2 with MPI/DPwith MPI/DP--SMPSMP

20000 25000 30000 35000 40000 45000 50000

MPI/DP-SMP

MPICH

Pixels/Second

skyvase.pov 1280x1024

Pixels per secondPixels per second

MPICHMPICH

MPI/DPMPI/DP--SMPSMP

2626

ConclusionsConclusions

§§ Using Using hardware broadcast feature, hardware broadcast feature, single single fatfat packet could be received by a number packet could be received by a number

of attached hosts at the switch.of attached hosts at the switch.§§ Compare to multiple Compare to multiple unicastunicast packetpacket

Larger bandwidthLarger bandwidthShorter latencyShorter latency

§§With EQA, the With EQA, the computationcomputation, , memorymemory and and networknetwork resources can be utilized more resources can be utilized more efficiently. efficiently.

2727

Future WorksFuture Works

vvDevelop more efficient reliable protocols on Develop more efficient reliable protocols on larger cluster sizes.larger cluster sizes.vvIncorporation of the hardware broadcast Incorporation of the hardware broadcast

facility with other parallel applications:facility with other parallel applications:vv Software DSM : JUMPSoftware DSM : JUMP--DPDPvv NN--Body simulationBody simulationvv ClusterCluster--based Web Caching : fast lookupbased Web Caching : fast lookupvv Search Engine: broadcast queriesSearch Engine: broadcast queriesvv Performance benchmarking softwarePerformance benchmarking software

The EndThe End

The Systems Research GroupThe Systems Research Grouphttp://http://www.www.srgsrg..csiscsis..hkuhku..hkhk//

Department of Computer Science Department of Computer Science and Information Systemsand Information SystemsThe University of Hong KongThe University of Hong Kong

efficient reliable broadcast for commodity clusters · 8 x intel mp1.4-complaint smp machines. each...

Documents