clusters in · cluster interconnection network characteristics : the interconnection network could...

CLUSTERS

in

distributed systems

Mariam A. Salih

What is cluster

Cluster classes

Cluster interconnection network

Cluster Architecture

Dedicated clusters

Cluster middleware

Single System Image

Myrinet Clos Network

CLUSTER EXAMPLES

What is cluster A cluster is a collection of stand-alone computers

connected using some interconnection network. Each node in a cluster could be a workstation, personal computer, or even a multiprocessor system.

A node in the cluster is an autonomous computer that may be engaged in its own private activities while at the same time cooperating with other units in the context of some computational task. Each node has its own input/output systems and its own operating system.

Cluster classes

When all nodes in a cluster have the same

architecture and run the same operating

system, the cluster is called

homogeneous, otherwise, it is

heterogeneous.

Cluster interconnection network

characteristics :

The interconnection network could be a fast LAN or a switch.

To achieve high-performance computing, the interconnection network must provide high-bandwidth and low-latency communication.

The nodes of a cluster may be dedicated to the cluster all the time; hence computation can be performed on the entire cluster.

Cluster Architecture

What is Dedicated clusters ?

Dedicated clusters are normally

packaged compactly in a single room.

With the exception of the front-end node,

all nodes are headless with no keyboard,

mouse, or monitor.

Dedicated clusters usually use high-speed

networks such as fast Ethernet and

Myrinet.

Cluster middleware

The middleware layer in the architecture makes the

cluster appears to the user as a single parallel machine.

which is referred to as the single system image (SSI).

the middleware will support features that enable the

cluster services for recovery from failure and fault

tolerance among all nodes of the cluster.

For example, the middleware should offer the

necessary infrastructure for “check-pointing”. A check-

pointing scheme makes sure that the process state is

saved periodically. In the case of node failure,

processes on the failed node can be restarted on

another working node.

Middleware Design Goals

Complete Transparency (Manageability).

Scalable Performance.

Enhanced Availability.

Complete Transparency

(Manageability)

The middleware must allow the user to use

a cluster easily and effectively without the

knowledge of the underlying system

architecture. This allows the user to access

system resources such as memory,

processors, and the network

transparently, irrespective of whether they

are available locally or remotely.

Scalable Performance

As clusters can easily be expanded, their

performance should scale as well. This

scalability should happen without the

need for new protocols.

To extract the maximum performance,

the middleware service must support load

balancing and parallelism by distributing

workload evenly among nodes.

Enhanced Availability

The middleware services must be highly available at all times. At any time, a point of failure should be recoverable without affecting a user's application.

When middleware services are offered using the resources available on multiple nodes, failure of any node should not a affect the system's operation.

What is Single System Image

(SSI)

SSI is the illusion, created by software or

hardware, that presents a collection of

computing resources and hide the

heterogeneous and distributed nature of the

available resources present them as single

unified computing resources.

SSI makes the cluster appear like a single

machine to the user, to applications, and to

the network.

Benefits of SSI

Use of system resources transparent.

Improved reliability and higher availability.

Improved system response time and performance

Simplified system management.

Reduction in the risk of operator errors.

No need to be aware of the underlying system architecture to use these machines effectively.

SSI Availability Support Functions

Single I/O space: Any node can access any peripheral or disk

devices without the knowledge of physical location.

Single process Space: Any process on any node create process with

cluster wide process wide and they communicate through signal, pipes, etc., as if they are one a single node.

Check-pointing and process migration: Can saves the process state and intermediate

results in memory to disk to support rollback recovery when node fails. RMS Load balancing.

SSI Levels

SSI levels of abstractions:

Application Level

Operating System Kernel Level

Hardware Level

Hardware level

Systems such as hardware resources,

Digital Memory Channel and hardware

distributed shared memory offer SSI at

hardware level and allow the user to view

cluster as a shared memory system.

Operating System Kernel Level

the operating system must support gang scheduling of

parallel programs, identify idle resources in the system

(such as processors, memory, and networks), and offer

globalized access to them.

It should optimally support process migration to provide

dynamic load balancing as well as fast inter-process

communication for both the system- and user-level

applications.

The OS must make sure these features are available to

the user without the need for additional system calls or

commands.

Application Level

The application-level SSI is the highest

and, in a sense, most important because

this is what the end user sees.

At this level, multiple cooperative

components of an application are

presented to the user as a single

application.

Myrinet Clos Network

Myrinet is a high-performance, packet-

communication and switching technology.

The basic building block of the Myrinet-2000

network is a 16-port Myrinet crossbar switch,

implemented on a single chip designated as

Xbar16. It can be interconnected to build

various topologies of varying sizes.

Myrinet Clos Network design

the upper row switches is the Clos network spine,

which is connected through a Clos spreader

network to the leaf switches forming the lower

row.

128 hosts

A 64-host Clos network using

16-port Myrinet switch each line represents two links

16-port Myrinet-2000 switch

A 32-host Clos network using

16-port Myrinet switch each line represents four links

routing of Myrinet

The routing of Myrinet packets is based on the

source routing approach.

Each Myrinet packet has a variable length

header with complete routing information.

When a packet enters a switch, the leading byte

of the header determines the outgoing port

before being stripped off the packet header.

At the host interface, a control program is

executed to perform source-route translation.

CLUSTER EXAMPLES

Berkeley Network of Workstations (NOW)

The Beowulf Cluster

Berkeley Network of Workstations

(NOW)

In 1997, the NOW project achieved over 10 Gflops on the Linpack benchmark, which made it one of the top 200 fastest supercomputers in the world.

hardware/software infrastructure for the project included 100 SUN Ultrasparcs and 40 SUN Sparcstations running Solaris, 35 Intel PCs running Windows NT or a PC Unix variant, and between 500 and 1000 disks, all connected by a Myrinet switched network.


(NOW)


(NOW)

The programming environments used in NOW

are sockets, MPI, and a parallel version of C,

called Split C.

Active Messages is the basic communication

primitive in Berkeley NOW.

The Active Messages communication is a

simplified remote procedure call that can be

implemented efficiently on a wide range of

hardware.

The Beowulf Cluster

The idea of the Beowulf cluster project

was to achieve supercomputer

processing power using off-the-shelf

commodity machines.

History of The Beowulf Cluster

One of the earliest Beowulf clusters contained sixteen

100 MHz DX4 processors that were connected using 10

Mbps Ethernet.

The second Beowulf cluster, built in 1995, used 100 MHz

Pentium processors connected by 100 Mbps Ethernet.

The third generation of Beowulf clusters was built by

different research laboratories. JPL and Los Alamos

National Laboratory each built a 16-processor machine

incorporating Pentium Pro processors. These machines

were combined to run a large N-body problem, which

won the 1997 Gordon Bell Prize for high performance.

Beowulf system In a Beowulf system, the application programs never

see the computational nodes (also called slave computers) but only interact with the "Master" which is a specific computer handling the scheduling and management of the slaves.

The slave computers typically have their own version of the operating system, and local memory and disk space.

the private slave network may also have a large and shared file server that stores global persistent data, accessed by the slaves as needed.

Beowulf system

The communication in Beowulf

Cluster

The communication between processors in

Beowulf has been done through TCP/IP over the

Ethernet internal to the cluster.

Multiple Ethernets were also used to satisfy higher

bandwidth requirements.

Channel bonding is a technique to connect

multiple Ethernets in order to distribute the

communication traffic. Channel bonding was able

to increase the sustained network throughput by

75% when dual networks were used.

Simple Quiz..

clusters in · cluster interconnection network characteristics : the interconnection network could...

Documents