clusters in · cluster interconnection network characteristics : the interconnection network could...
TRANSCRIPT
CLUSTERS
in
distributed systems
Mariam A. Salih
What is cluster
Cluster classes
Cluster interconnection network
Cluster Architecture
Dedicated clusters
Cluster middleware
Single System Image
Myrinet Clos Network
CLUSTER EXAMPLES
What is cluster A cluster is a collection of stand-alone computers
connected using some interconnection network. Each node in a cluster could be a workstation, personal computer, or even a multiprocessor system.
A node in the cluster is an autonomous computer that may be engaged in its own private activities while at the same time cooperating with other units in the context of some computational task. Each node has its own input/output systems and its own operating system.
Cluster classes
When all nodes in a cluster have the same
architecture and run the same operating
system, the cluster is called
homogeneous, otherwise, it is
heterogeneous.
Cluster interconnection network
characteristics :
The interconnection network could be a fast LAN or a switch.
To achieve high-performance computing, the interconnection network must provide high-bandwidth and low-latency communication.
The nodes of a cluster may be dedicated to the cluster all the time; hence computation can be performed on the entire cluster.
Cluster Architecture
What is Dedicated clusters ?
Dedicated clusters are normally
packaged compactly in a single room.
With the exception of the front-end node,
all nodes are headless with no keyboard,
mouse, or monitor.
Dedicated clusters usually use high-speed
networks such as fast Ethernet and
Myrinet.
Cluster middleware
The middleware layer in the architecture makes the
cluster appears to the user as a single parallel machine.
which is referred to as the single system image (SSI).
the middleware will support features that enable the
cluster services for recovery from failure and fault
tolerance among all nodes of the cluster.
For example, the middleware should offer the
necessary infrastructure for “check-pointing”. A check-
pointing scheme makes sure that the process state is
saved periodically. In the case of node failure,
processes on the failed node can be restarted on
another working node.
Middleware Design Goals
Complete Transparency (Manageability).
Scalable Performance.
Enhanced Availability.
Complete Transparency
(Manageability)
The middleware must allow the user to use
a cluster easily and effectively without the
knowledge of the underlying system
architecture. This allows the user to access
system resources such as memory,
processors, and the network
transparently, irrespective of whether they
are available locally or remotely.
Scalable Performance
As clusters can easily be expanded, their
performance should scale as well. This
scalability should happen without the
need for new protocols.
To extract the maximum performance,
the middleware service must support load
balancing and parallelism by distributing
workload evenly among nodes.
Enhanced Availability
The middleware services must be highly available at all times. At any time, a point of failure should be recoverable without affecting a user's application.
When middleware services are offered using the resources available on multiple nodes, failure of any node should not a affect the system's operation.
What is Single System Image
(SSI)
SSI is the illusion, created by software or
hardware, that presents a collection of
computing resources and hide the
heterogeneous and distributed nature of the
available resources present them as single
unified computing resources.
SSI makes the cluster appear like a single
machine to the user, to applications, and to
the network.
Benefits of SSI
Use of system resources transparent.
Improved reliability and higher availability.
Improved system response time and performance
Simplified system management.
Reduction in the risk of operator errors.
No need to be aware of the underlying system architecture to use these machines effectively.
SSI Availability Support Functions
Single I/O space: Any node can access any peripheral or disk
devices without the knowledge of physical location.
Single process Space: Any process on any node create process with
cluster wide process wide and they communicate through signal, pipes, etc., as if they are one a single node.
Check-pointing and process migration: Can saves the process state and intermediate
results in memory to disk to support rollback recovery when node fails. RMS Load balancing.
SSI Levels
SSI levels of abstractions:
Application Level
Operating System Kernel Level
Hardware Level
Hardware level
Systems such as hardware resources,
Digital Memory Channel and hardware
distributed shared memory offer SSI at
hardware level and allow the user to view
cluster as a shared memory system.
Operating System Kernel Level
the operating system must support gang scheduling of
parallel programs, identify idle resources in the system
(such as processors, memory, and networks), and offer
globalized access to them.
It should optimally support process migration to provide
dynamic load balancing as well as fast inter-process
communication for both the system- and user-level
applications.
The OS must make sure these features are available to
the user without the need for additional system calls or
commands.
Application Level
The application-level SSI is the highest
and, in a sense, most important because
this is what the end user sees.
At this level, multiple cooperative
components of an application are
presented to the user as a single
application.
Myrinet Clos Network
Myrinet is a high-performance, packet-
communication and switching technology.
The basic building block of the Myrinet-2000
network is a 16-port Myrinet crossbar switch,
implemented on a single chip designated as
Xbar16. It can be interconnected to build
various topologies of varying sizes.
Myrinet Clos Network design
the upper row switches is the Clos network spine,
which is connected through a Clos spreader
network to the leaf switches forming the lower
row.
128 hosts
A 64-host Clos network using
16-port Myrinet switch each line represents two links
16-port Myrinet-2000 switch
A 32-host Clos network using
16-port Myrinet switch each line represents four links
routing of Myrinet
The routing of Myrinet packets is based on the
source routing approach.
Each Myrinet packet has a variable length
header with complete routing information.
When a packet enters a switch, the leading byte
of the header determines the outgoing port
before being stripped off the packet header.
At the host interface, a control program is
executed to perform source-route translation.
CLUSTER EXAMPLES
Berkeley Network of Workstations (NOW)
The Beowulf Cluster
Berkeley Network of Workstations
(NOW)
In 1997, the NOW project achieved over 10 Gflops on the Linpack benchmark, which made it one of the top 200 fastest supercomputers in the world.
hardware/software infrastructure for the project included 100 SUN Ultrasparcs and 40 SUN Sparcstations running Solaris, 35 Intel PCs running Windows NT or a PC Unix variant, and between 500 and 1000 disks, all connected by a Myrinet switched network.
Berkeley Network of Workstations
(NOW)
Berkeley Network of Workstations
(NOW)
The programming environments used in NOW
are sockets, MPI, and a parallel version of C,
called Split C.
Active Messages is the basic communication
primitive in Berkeley NOW.
The Active Messages communication is a
simplified remote procedure call that can be
implemented efficiently on a wide range of
hardware.
The Beowulf Cluster
The idea of the Beowulf cluster project
was to achieve supercomputer
processing power using off-the-shelf
commodity machines.
History of The Beowulf Cluster
One of the earliest Beowulf clusters contained sixteen
100 MHz DX4 processors that were connected using 10
Mbps Ethernet.
The second Beowulf cluster, built in 1995, used 100 MHz
Pentium processors connected by 100 Mbps Ethernet.
The third generation of Beowulf clusters was built by
different research laboratories. JPL and Los Alamos
National Laboratory each built a 16-processor machine
incorporating Pentium Pro processors. These machines
were combined to run a large N-body problem, which
won the 1997 Gordon Bell Prize for high performance.
Beowulf system In a Beowulf system, the application programs never
see the computational nodes (also called slave computers) but only interact with the "Master" which is a specific computer handling the scheduling and management of the slaves.
The slave computers typically have their own version of the operating system, and local memory and disk space.
the private slave network may also have a large and shared file server that stores global persistent data, accessed by the slaves as needed.
Beowulf system
The communication in Beowulf
Cluster
The communication between processors in
Beowulf has been done through TCP/IP over the
Ethernet internal to the cluster.
Multiple Ethernets were also used to satisfy higher
bandwidth requirements.
Channel bonding is a technique to connect
multiple Ethernets in order to distribute the
communication traffic. Channel bonding was able
to increase the sustained network throughput by
75% when dual networks were used.
Simple Quiz..