nwu and hpc

42
igh erformance omputing Computing Attie Juyn & Wilhelm van Belkum

Upload: wilhelm-van-belkum

Post on 06-May-2015

837 views

Category:

Education


1 download

DESCRIPTION

Presentations on NWU's HPC strategy

TRANSCRIPT

Page 1: NWU and HPC

igh

erformance

omputing

Computing

Attie Juyn & Wilhelm van Belkum

Page 2: NWU and HPC

Agenda

The birth of a HPC…• Part A: management perspective• Part B: technical perspective

Page 3: NWU and HPC

Background

• Various departmental compute clusters• A flagship project at the CHPC• Fragmented resources and effort

At last year’s conference, our vision was ….

Page 4: NWU and HPC

10 GFlop

> 2 TFlops

1-2 TFlops

40-100 GFlop

To establish an Institutional HPC

Level 1 : (Entry Level) Personal workstationLevel 1 : (Entry Level) Personal workstation

Level 2 : Departmental Compute ClusterLevel 2 : Departmental Compute Cluster

Level 3 : Institutional HPCLevel 3 : Institutional HPC

Level 4 Level 4 Nat./Int. HPCNat./Int. HPC

Page 5: NWU and HPC

University Strategy

• Increased focus on researchTo develop into a balanced teaching-learning & research university

• As a result of merger, a central IT department

Page 6: NWU and HPC

The Challenge: to innovate

• SustainabilityHPC must be a service, not a project or experiment

Funding model must enable constant renewal

Support model with clear responsibilities

• ReliabilityRedundant design principles (DR capability)

24x7x365 (not 99.99%)

• AvailabilityStandardised user interface (not root)

Equally accessible on all campuses

• EfficiencyPower, cooling, etc.

Page 7: NWU and HPC

HPC (IT) success criteria

Sustainability

Efficiency

Reliability

Availability

= key issues of this decade

& Performance

Page 8: NWU and HPC

Enabling factors

• A spirit of co-operation Key researchers & IT agreeing on what should be done

• A professional, experienced IT team

Supporting +- 200 servers in 4 distributed data centers

• A well managed, state-of-the-art infrastructureResulting from the merger period

• Management trust & commitment• International support & connections

Networks, grids, robust & open software

Page 9: NWU and HPC

Project milestones

• March 2007: first discussions & documentation of vision

• April 2007: budget compilation and submission• 27 November 2007: Project and budget approved• December 2007: CHPC Conference, tested our vision• 17 March 2008: Dr Bruce Becker visits Potchefstroom

(first discussions of gLite, international & SA grids)

• 18 March 2008: Grid concept presented to IT Directors• May 2008: established POC cluster, testing software• June-October: recruitment & training of staff• July 2008: Grid Conference at UCT & SA Grid Initiation• August – September 2008: detailed planning & testing• October 2008: tenders & ordered equipment• Nov. 2008 - Jan. 2009: implementation

Page 10: NWU and HPC

Management principles

• A dedicated research facility(not for general computing)

• To serve researchers in approved research programmes of all three campuses

• Implemented, maintained and supported by Institutional IT(IT should do the IT)

• Configured to international standards & best practice(to be shown later)

• Parallel applications only• Usage governed by an institutional and representative governance body• Sustainability subject to acceptable ROI

(to justify future budgets)

Page 11: NWU and HPC
Page 12: NWU and HPC

The New World Order

Mainframe

Mini Computer

PC

Cluster & Grids

Vector Supercomputer

Source 2006 UC Regents

Page 13: NWU and HPC

Technical goals

Build a

Institutional High Performance Computing facility,

based on Beowulf cluster principals,

coexisting and linking

existing departmental cluster,

the National and International computational Grids

Page 14: NWU and HPC

Beowulf cluster

• The term "Beowulf cluster" refers to a cluster of workstations (usually Intel architecture, but not necessarily) running some flavor of Linux that is utilized as a parallel computation resource.

• The main idea is to use commodity, off-the-shelf computing components with Open Source software to create a networked cluster of workstations.

Page 15: NWU and HPC

History of Clusters - The first Beowulf

07/2002 – Design system

08/2002 to 11/2002 – Build system

03/2003 – System in Production

• 7-8 Months for Concept to Production

• Moore’s Law 18 months to

-> Half life of performance and cost

-> Useful life 3-4 years

Source 2006 UC Regents

Page 16: NWU and HPC

The Evolved Cluster

ResourceManager

Scheduler

Compute Nodes

Admin

User

LicenseManager

JobQueue

Myrinet

IdentityManager

AllocationManager

ResourceManager

Scheduler

Departmental Cluster

Source Cluster Resources, Inc.

Page 17: NWU and HPC

Cluster and Grid software landscape

Page 18: NWU and HPC

MPIMPI PVMPVM LAMLAM MPICHMPICH

ParallelParallel SerialSerialApplicationApplication

Resource ManagerResource ManagerRocksRocks OscarOscar

MPIMPI PVMPVM LAMLAM MPICHMPICH

ParallelParallel SerialSerialApplicationApplication

Resource ManagerResource ManagerOscarOscar TorqueTorqueRocksRocks

Grid/Cluster Stack or Framework

Hardware (Cluster or SMP)Hardware (Cluster or SMP)

CentOSCentOS SolarisSolaris RedHatRedHat UNICOSUNICOS AIXAIX Scientific Scientific LinuxLinux

WindowsWindowsMac OS XMac OS XHP UXHP UX OtherOther

Operating SystemOperating System

Se

cu

rityS

ec

urity

EGEEChineseUSA EU

GLOBUSGLOBUS CROWNGridCROWNGrid gLitegLite UNICOREUNICORE

Grid Workload Manager: Scheduler, Policy Manager, Integration PlatformGrid Workload Manager: Scheduler, Policy Manager, Integration Platform

Load Load LevelerLeveler

PBSpro PBSPBS SGESGE Condor(G)Condor(G) LSFLSF SLURMSLURM

Cluster Workload Manager: Scheduler, Policy Manager, Integration PlatformCluster Workload Manager: Scheduler, Policy Manager, Integration Platform

NimrodNimrod MOABMOAB MAUIMAUI

PortalPortal

CLICLI

GUIGUI

ApplicationApplication

UsersUsersAdminAdmin

Page 19: NWU and HPC

Departmental Computer Cluster

Page 20: NWU and HPC

CHPC (May 2007)

iQudu” (isiXhosa name for Kudu “Tshepe” (Sesotho name for ‘Springbok’) and Impala

• 160 Node Linux cluster

• Each node with 2xdual-core AMD Opteron 2.6GHz Ref F processors and 16GB of random access memory

• Infiniband 10 GB cluster interconnecting

• 50TB SAN

• 640 processing (2.5 Teraflops per second )

• 2x IBM p690 with 32 x 1.9GHz Power4+ CPUs

• 32GB of RAM each

Page 21: NWU and HPC

The #1 and #13 in world (2007)

BlueGene/L - eServer Blue Gene Solution (IBM/212992 Power cores)

DOE/NNSA/LLNL - USA

MareNostrum - BladeCenter JS21 Cluster, PPC 970, 2.3 GHz, Myrinet (IBM 10240 Power cores)

Barcelona Supercomputer Centre – Spain (63.83 teraFLOP)

478.2 trillion floating operations per second (teraFLOPS) on LINPACK

The #4 and #40 in world (2008)

Page 22: NWU and HPC

As of November 2008 #1 : Roadrunner

Roadrunner - BladeCenter QS22/LS21 Cluster, 12,240 x PowerXCell 8i 3.2 Ghz6,562 Dual-Core Opteron 1.8 GHz

DOE/NNSA/LANL - United States

1.105 PetaFlop

Page 23: NWU and HPC

Reliability & Availability of HPC

Page 24: NWU and HPC

HPC (IT) success criteria

Sustainability

Efficiency

Reliability

Availability

= key issues of this decade

& Performance

Page 25: NWU and HPC

Introducing - Utility Computing

Grid Workload ManagerCondor, MOAB

Utility Computing

DataDataCenterCenter

RMRM

HPCHPC

RMRM

Swap & migrating of Hardware (First Phase)

Dynamic load shifting on RM level (Second Phase)

Page 26: NWU and HPC

Hardware (Cluster or SMP)Hardware (Cluster or SMP)

MPIMPI PVMPVM LAMLAM MPICHMPICH

ParallelParallel SerialSerialApplicationApplication

Resource ManagerResource ManagerRocksRocks OscarOscar

MPIMPI PVMPVM LAMLAM MPICHMPICH

ParallelParallel SerialSerialApplicationApplication

Resource ManagerResource ManagerOscarOscar TorqueTorqueRocksRocks

Grid/Cluster Stack or Framework

CentOSCentOS SolarisSolaris RedHatRedHat UNICOSUNICOS AIXAIX Scientific Scientific LinuxLinux

WindowsWindowsMac OS XMac OS XHP UXHP UX OtherOther

Operating SystemOperating System

EGEEChineseUSA EU

GLOBUSGLOBUS CROWNGridCROWNGrid gLitegLite UNICOREUNICORE

Grid Workload Manager: Scheduler, Policy Manager, Integration PlatformGrid Workload Manager: Scheduler, Policy Manager, Integration Platform

Load Load LevelerLeveler

PBSproPBSpro PBSPBS SGESGE Condor(G)Condor(G) LSFLSF SLURMSLURM

Cluster Workload Manager: Scheduler, Policy Manager, Integration PlatformCluster Workload Manager: Scheduler, Policy Manager, Integration Platform

NimrodNimrod MOABMOAB MAUIMAUI

Se

cu

rityS

ec

urity

PortalPortal

CLICLI

GUIGUI

ApplicationApplication

UsersUsersAdminAdmin

Page 27: NWU and HPC

HP BL460c8*3GHz Xeon 12G L2, 1333Mhz FSB10G memory(96GFlop)

HP Modular Cooling System G2Up to 4 HP C7000, 512 CPU cores 5.12 TFlop

HP Blc Virtual Connect Ethernet

D-Link X-stack DSN320010.5TB RAID5, 80 000 I/O per second

HP C7000 Up to 16 HP2x220c (3.072TFlop)

1024 CPU cores HP2x220c (12.288TFlop)

BL2x220c16*3GHz Xeon

192GFlop HP C7000 Up to 16 HP460c (1.536TFlop)

Page 28: NWU and HPC

HP ProLiant BL460c

BL460c

ProcessorUp to two Dual & Quad-Core Intel Xeon processors

Memory• FBDIMM 667MHz• 8 DIMM Slots• 32GB max

Internal Storage• 2 Hot-Plug SFF SAS HDDs• Standard RAID 0/1 controller with

optional BBWC

Networking 2 integrated Multifunction Gigabit NICs

Mezzanine Slots 2 mezzanine expansion slots

ManagementIntegrated Lights Out 2 Standard Blade Edition

Page 29: NWU and HPC

BL460c Internal View

Embedded Smart ArrayController integrated on

drive backplane

8 Fully Buffered DIMM SlotsDDR II 667Mhz

Two Mezzanine Slots:•One x4•One x8

Two hot-plug

SAS/SATA drive bays

QLogic QMH24622-pt 4Gb FC HBA

NC512m 2-pt 10GbE-KX4Netxen

4x DDR InfiniBand2-pt 4X DDR (20Gb) Mellanox

Page 30: NWU and HPC

HP ProLiant BL2x220c G5

BL2x220c G5

ProcessorUp to two Dual or Quad-Core Intel® Xeon® processors per board

Memory• Registered DDR2 (533/667 MHz)• 4 DIMM Sockets per board• 16GB max (with 4GB DIMMs)

Internal Storage 1 Non Hot-Plug SFF SATA HDD per board

Networking 2 integrated Gigabit NICs per board

Mezzanine Slots1 PCIe mezzanine expansion slot (x8, Type I) per board

ManagementIntegrated Lights Out 2 Standard Blade Edition

Density32 server blades in 10U enclosure16 server blades in 6U enclosure*2 blades per HH enclosure bay

Page 31: NWU and HPC

HP ProLiant BL2x220c G5

Internal View

Two Mezzanine Slots

Two x8 (both reside on bottom board)2 x Optional SATA HDDs

Top and bottom PCA, side by side

2 x 2 CPUs 2 x 4 DIMM Slots

DDR2 533/667MHz

2 x Embedded 1Gb Ethernet

Dual-Port NICs

Server Board Connectors

Page 32: NWU and HPC

Servers and other

racked equipment

Half-Height Blade Server• Up to 16 per enclosure

10U

Max. Capacity•HP Modular Cooling System G2•Up to 4 HP C7000, •1024 CPU cores •12.228 TFlop

Page 33: NWU and HPC

NWU HPC Hardware Spec.

• 16 Dual Quad-Core Intel Xeon E5450– 3GHz CPU , 12MB L2, 1333MHz FSB, 80W power– 16xHP BL460c– 10G Memory– HP c7000 enclosure– HP Modular Cooling System G2 (MCS G2)– Link iSCSI DSN-3200 (20Tb disk)

• 16 Dual quad-Core Intel Xeon E5450– 3GHz CPU , 12MB L2, 1333MHz FSB, 80W power– 8xHP BL2x220c– 10G Memory– HP c7000– HP Modular Cooling System G2 (MCS G2)– Link iSCSI DSN-3200 (20Tb disk)

• 32 * 8 *3Ghz *4 = 3.072TFlops (256 Cores)• 32 * 10 Gbyte = 320 G memory• 2 * 10 TByte storage • Gig Ethernet Interconnect : 42.23 microseconds

latency (IB= 4 Microseconds)

Page 34: NWU and HPC

NWU HPC/Grid

Campus GRID

Page 35: NWU and HPC

University Wide Area Network/Internet

Total of 45Mbps 34.2Mbps International

Telkom

Page 36: NWU and HPC

SANREN

SANREN Vision and the Players

InfraCo

SEACOM

Page 37: NWU and HPC

SA-Grid

CHPC

NWUC4

UOVS

SA-Grid

Page 38: NWU and HPC

SEACOM

TE-North is a new cable currently being laid across the

Mediterranean Sea

Cable Laying to start Oct. 08Final splicing April 09Service launch June 09

Page 39: NWU and HPC

International Grid

Page 40: NWU and HPC

High Performance Computing @ NWU

1/1/2008 12/20/2009

4/1/2008 7/1/2008 10/1/2008 1/1/2009 4/1/2009 7/1/2009 10/1/2009

11/28/2008HPC

6/30/2009Campus GRID

11/29/2009National GRID

12/20/2009International GRID

12/15/2008

Page 41: NWU and HPC

igh

erformance

omputing

Scientific Linux

Computing

orth est niversity

SustainableEfficient Reliable

High Availability

&Performance

@ >3TFlop

Page 42: NWU and HPC