multicomputer distributed system 2014/saman... · 2017. 1. 22. · was first defined by the ieee...

LECTURE 8

DR. SAMMAN H. AMEEN

Multicomputer distributed

system

1

• Wide area network (WAN); A WAN connects a large number of computers that are spread over large geographic distances. It can span sites in multiple cities, countries, and continents.

• Metropolitan area network (MAN); The MAN is an intermediate level between the LAN and WAN and can perhaps span a single city.

• Local area network (LAN); A LAN connects a small number of computers in a small area within a building or campus.

• System or storage area network (SAN). A SAN connects computers or storage devices to make a single system.

• A network channel c=(x,y) is characterized by

• width wc: the number of parallel signals it contains,

• frequency fc: the rate at which bits are transported at each signal

• latency tc is the time required for a bit to travel from x to y.

• A bandwidth of a channel is W= wc * fc.

• The throughput Θ of a network is the data rate in bits per second that network accepts per input port.

• Under a particular traffic pattern, the channel that carries the largest fraction of the traffic determines the maximum channel load γ. Load on the channel can be equal or smaller than channel bandwidth.

• Θ=W/γ

• Deterministic: The simplest algorithm - for each source, destination pair, there is a single path. This routing algorithm usually achieves poor performance because it fails to use alternative routes, and concentrates traffic on only one set of channels.

• Oblivious: So named because it ignores the state of the network when determining a path. Unlike deterministic, it considers a set of paths from a source to a destination, and chooses between them.

• Adaptive: The routing algorithm changes based on the state of the network.

• Message: logical unit for internode communication

• Packet: basic unit containing destination address for routing

• Packets have sequencing # for reassembly

• Flits: flow control digits of packets

• Header flits contain routing information and sequence number

• Flit length affected by network size

• Packet length determined by routing scheme and network implementation

• Lengths also dependent on channel b/w, router design, network traffic, etc.

• 100 Gigabit Ethernet (100GbE) and 40 Gigabit Ethernet (40GbE) are groups of computer networking technologies for transmitting Ethernet frames at rates of 100 and 40 gigabits per second (100 and 40 Gbit/s), respectively. The technology was first defined by the IEEE 802.3ba-2010 standard.

• InfiniBand (abbreviated IB) is a computer network communications link used in high-performance computing featuring very high throughput and very low latency. It is used for data interconnect both among and within computers. As of 2014 it is the most commonly used interconnect in supercomputers.

•SDR Single Data Rate 2.5GB/S * 4 = 10

• DDR Double Data Rate 5 GB/S * 4 = 20

• QDR Quadruple Data Rate 10GB/S * 4 = 40

• FDR Fourteen Data Rate 14 Gb/s * 4 = 56

• EDR Enhanced Data Rate 25 Gb/s * 4 = 100

• HDR - High Data Rate

• NDR - Next Data Rate

• Latency is an element that contributes to network speed.

• The term latency refers to any kind of delay typically incurred in processing of network data.

• A low latency connection is one that generally experiences small delay times, while a high latency connection generally suffers from long delays.

Characteristics

SDR DDR QDR FDR-10 FDR EDR HDR NDR

Theoretical

effective

throughput,

Gbit/s, per 1x

2 4 8 10 14 25 50

Speeds for 4x

and 12x (Gbit/s)8, 24 16, 48 32, 96

41.25,

123.75

54.54,

163.64100, 300 200, 600

Latency

(microseconds)[5 2.5 1.3 0.7 0.7 0.5

Year2001,

20032005 2007 2011 2014 ~2017 after 2020

http://en.wikipedia.org/wiki/InfiniBand#cite_note-5

In the context of parallel computing, granularity is the

ratio of communication time over computation time.

Fine grain parallelism is characterized by seemingly

more communications as the relative computation time

is shorter. Coarse grain parallelism, then, is

characterized by seemingly fewer communications with

much longer computation time.

Load balance is easier to achieve with fine grain parallelism

because small tasks depend less on the operating system,

interrupts and so on. Coarse grain parallelism, on the converse,

makes it harder to predict when any given task will terminate,

therefore making it harder to assign tasks for optimal usage of

the multiple processors.

Fine grain parallelism requires more synchronization overhead

due to the need to communicate data and synchronize tasks

among processors. Therefore, the fewer communications in

coarse grain parallelism reduces overhead.

• Greedy: Always send the packet in the shortest direction around the ring. For example, always route from 0 to 3 in the clockwise direction and from 0 to 5 in the counterclockwise direction. If the distance is the same in both directions, pick a direction randomly.

• Uniform random: Randomly pick a direction for each packet, with equal probability of picking either direction.

• Weighted random: Randomly pick a direction for each packet, but weight the short direction with probability 1 - Δ /N and the long direction with Δ/N, where Δ is the (minimum) distance between the source and destination. N is number of links

• Adaptive: Send the packet in the direction for which the local channel has the lowest load. We may approximate load by either measuring the length of the queue serving this channel or recording how many packets it has transmitted over the last T slots.

14

• Circuit switching• A “circuit” path is established a priori and torn down after use

• Routing, arbitration, switching performed once for train of packets

• Reduces latency and overhead

• Can be highly wasteful of scarce network bandwidth

• Links and switches go under utilized

• during path establishment and tear-down

• if no train of packets follows circuit set-up

Source end node

Destination end node

Buffers for “request”

tokens

Request for circuit establishment(routing and arbitration is performed during this step)

Source end node


Buffers for “request”

tokens

Request for circuit establishment

Source end node


Buffers for “ack” tokens

Acknowledgment and circuit establishment(as token travels back to the source, connections are established)

Request for circuit establishment

Source end node


Acknowledgment and circuit establishment

Packet transport(neither routing nor arbitration is required)

HiRequest for circuit establishment

Source end node


Acknowledgment and circuit establishment

Packet transport

X

High contention, low utilization (r) low throughput

• Routing, arbitration, switching is performed on a per-packet basis

• Sharing of network link bandwidth is done on a per-packet basis

• More efficient sharing and use of network bandwidth by multiple flows if transmission of packets by individual sources is more intermittent

• Store-and-forward switching

• Bits of a packet are forwarded only after entire packet is first stored

• Packet transmission delay is multiplicative with hop count, d

• Cut-through switching

• Bits of a packet are forwarded once the header portion is received

• Packet transmission delay is additive with hop count, d

• Virtual cut-through: flow control is applied at the packet level

• Wormhole: flow control is applied at the flow unit (flit) level

Source end node


Packets are completely stored before any portion is forwarded

Store

Buffers for datapackets

Source end node


Packets are completely stored before any portion is forwarded

StoreForward

Requirement:buffers must be

sized to holdentire packet

(MTU)

Source end node


Routing

Portions of a packet may be forwarded (“cut-through”) to the next switchbefore the entire packet is stored at the current switch

CUT-THROUGH

• Wormhole

Source end node


Source end node


Requirement:buffers must be sized to hold entire packet

(MTU)

Buffers for flits:packets can be larger

than buffers

VIRTUAL CUT-THROUGH

• Wormhole

Source end node


Source end node


BusyLink

Packet stored along the path

BusyLink

Packet completelystored atthe switch

Maximizing sharing of link BW increases r ( i.e., rS )


Requirement:buffers must be sized to hold entire packet

(MTU)

Buffers for flits:packets can be larger

than buffers

VIRTUAL CUT-THROUGH

multicomputer distributed system 2014/saman... · 2017. 1. 22. · was first defined by the ieee...

Documents