n. xiong@ gsu slide 1 chapter 05 clustered systems for massive parallelism n. xiong georgia state...

49
N. Xiong@ GSU Chapter 05 Clustered Systems for Massive Parallelism N. Xiong Georgia State University

Upload: darren-powers

Post on 03-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

N. Xiong@ GSU Slide 1

Chapter 05

Clustered Systems for

Massive Parallelism

N. Xiong

Georgia State University

N. Xiong@ GSU Slide 2

Chapter 05

Review and Introduction

N. Xiong@ GSU Slide 3

Chapter 05

Design Objectives of Clusters and MPPs Cluster and MPP System Architectures Design Principles of Clustered Systems Multiple Job Scheduling and

Management Virtual Clustering and Resource

Provisioning Homework Problems

Chapter 04 Main Contents

N. Xiong@ GSU Slide 4

Chapter 05

Scalability Packaging Control Homogeneity Security

Design Objectives of Clustered Systems

N. Xiong@ GSU Slide 5

Chapter 05

Design Objectives of Clustered Systems

N. Xiong@ GSU Slide 6

Chapter 05

Fundamental Cluster Design Issues

Scalable Performance Single System Image Availability Support Cluster Job Management Internode Communication Fault Tolerance and Recovery Growth of Servers in HPC and

HTC Systems

N. Xiong@ GSU Slide 7

Chapter 05

Resource-Sharing in Cluster Systems

N. Xiong@ GSU Slide 8

Chapter 05

An Idealized Cluster Architecture

Conventional databases and OLTP monitors offer users a desktop environment

Supports parallel programming based on standard languages and communication libraries

A user-interface subsystem combines the advantages of the Web interface and the windows GUI

N. Xiong@ GSU Slide 9

Chapter 05

Node Architectures and System Packaging

Two types of cluster nodes compute nodes service nodes

N. Xiong@ GSU Slide 10

Chapter 05

Compute Node Examples

N. Xiong@ GSU Slide 11

Chapter 05

Modular Packaging of IBM BlueGene/L System

N. Xiong@ GSU Slide 12

Chapter 05

Cluster System Interconnects

N. Xiong@ GSU Slide 13

Chapter 05

High-Bandwidth Interconnects

N. Xiong@ GSU Slide 14

Chapter 05

An InfiniBand Cluster Interconnection Network

N. Xiong@ GSU Slide 15

Chapter 05

High-bandwidth Interconnects in Top-500 Systems

N. Xiong@ GSU Slide 16

Chapter 05

Hardware, Software, and Middleware Support

N. Xiong@ GSU Slide 17

Chapter 05

Design Principles of Clusters

Single-System-Image (SSI ) Features Single System Single Control Symmetry Location Transparent

N. Xiong@ GSU Slide 18

Chapter 05

Design Principles of Clusters

Single-System-Image Layers Application Software Layer Hardware or Kernel Layer Middleware Layer

N. Xiong@ GSU Slide 19

Chapter 05

Design Principles of Clusters

Single-System-Image Composition Single Entry Point Single File Hierarchy Single I/O, Networking, and Memory

Space Other Desired SSI Features

N. Xiong@ GSU Slide 20

Chapter 05

Single Entry Point

N. Xiong@ GSU Slide 21

Chapter 05

Single File Hierarchy

It is persistent. It is fault tolerant to some

degree. Network File System (NFS)

and Andrew File System (AFS).

N. Xiong@ GSU Slide 22

Chapter 05

Single File Hierarchy

N. Xiong@ GSU Slide 23

Chapter 05

Single I/O, Networking, and Memory Space

Single Input/Output Single Networking Single Point of Control Single Memory Space

N. Xiong@ GSU Slide 24

Chapter 05

Single I/O, Networking, and Memory Space

N. Xiong@ GSU Slide 25

Chapter 05

An Example

N. Xiong@ GSU Slide 26

Chapter 05

Other Desired SSI Features

Single Job Management System

Single User Interface Single Process Space

N. Xiong@ GSU Slide 27

Chapter 05

Middleware Support for SSI Clustering

N. Xiong@ GSU Slide 28

Chapter 05

High Availability Through Redundancy

Reliability Availability Serviceability

N. Xiong@ GSU Slide 29

Chapter 05

Availability and Failure Rate

N. Xiong@ GSU Slide 30

Chapter 05

Availability Values of Several Representative Systems

N. Xiong@ GSU Slide 31

Chapter 05

Redundancy Techniques

N. Xiong@ GSU Slide 32

Chapter 05

Fault-Tolerant Cluster Configurations

Hot Standby Mutual Takeover Fault-Tolerance

N. Xiong@ GSU Slide 33

Chapter 05

Recovery Schemes

Backward recovery Forward recovery: in real-

time systems

N. Xiong@ GSU Slide 34

Chapter 05

Checkpointing and Recovery Techniques

Kernel, Library, and Application Levels Checkpoint Overheads Choosing an Optimal Checkpoint Interval

N. Xiong@ GSU Slide 35

Chapter 05

Checkpointing Parallel Programs

N. Xiong@ GSU Slide 36

Chapter 05

Cluster Job Scheduling and Management

Cluster Job Management Issues A user server A job scheduler A resource manager

N. Xiong@ GSU Slide 37

Chapter 05

Cluster Job Types

Serial jobs Parallel jobs Interactive jobs Batch jobs Foreign jobs

N. Xiong@ GSU Slide 38

Chapter 05

Multi-Job Scheduling Schemes

N. Xiong@ GSU Slide 39

Chapter 05

Share Cluster Nodes

Dedicated Mode Space Sharing

Time Sharing

N. Xiong@ GSU Slide 40

Chapter 05

Migration Schemes Issues

Node Availability Migration Overhead Recruitment Threshold:

the amount of time a workstation stays unused before the cluster considers it an idle node

N. Xiong@ GSU Slide 41

Chapter 05

Virtual Clustering and Resource Provisioning

N. Xiong@ GSU Slide 42

Chapter 05

Five Virtual Cluster Research Projects

N. Xiong@ GSU Slide 43

Chapter 05

Live VM Migration and Cluster Management

N. Xiong@ GSU Slide 44

Chapter 05

Effect by Live Migration

N. Xiong@ GSU Slide 45

Chapter 05

Dynamic Virtual Resource Provisioning

N. Xiong@ GSU Slide 46

Chapter 05

Autonomic Adaptation of Virtual Environments

N. Xiong@ GSU Slide 47

Chapter 05

Some References and Further Reading

N. Xiong@ GSU Slide 48

Chapter 05

Homework Problems

N. Xiong@ GSU Slide 49

Chapter 05

Homework Problems